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Preface 



The message passing paradigm is the most frequently used approach to develop 
high-performance computing applications on parallel and distributed computing 
architectures. Parallel Virtual Machine (PVM) and Message Passing Interface 
(MPI) are the two main representatives in this domain. 

This volume comprises 50 selected contributions presented at the lltlr Eu- 
ropean PVM/MPI Users’ Group Meeting, which was held in Budapest, Hun- 
gary, September 19-22, 2004. The conference was organized by the Laboratory 
of Parallel and Distributed Systems (LPDS) at the Computer and Automation 
Research Institute of the Hungarian Academy of Sciences (MTA SZTAKI). 

The conference was previously held in Venice, Italy (2003), Linz, Austria 
(2002), Santorini, Greece (2001), Balatonfiired, Hungary (2000), Barcelona, 
Spain (1999), Liverpool, UK (1998), and Krakow, Poland (1997). The first three 
conferences were devoted to PVM and were held in Munich, Germany (1996), 
Lyon, France (1995), and Rome, Italy (1994). 

In its eleventh year, this conference is well established as the forum for users 
and developers of PVM, MPI, and other message passing environments. Interac- 
tions between these groups have proved to be very useful for developing new ideas 
in parallel computing, and for applying some of those already existent to new 
practical fields. The main topics of the meeting were evaluation and performance 
of PVM and MPI, extensions, implementations and improvements of PVM and 
MPI, parallel algorithms using the message passing paradigm, and parallel ap- 
plications in science and engineering. In addition, the topics of the conference 
were extended to include cluster and grid computing, in order to reflect the 
importance of this area for the high-performance computing community. 

Besides the main track of contributed papers, the conference featured the 
third edition of the special session “ParSim 04 - Current Trends in Numerical 
Simulation for Parallel Engineering Environments” . The conference also included 
three tutorials, one on “Using MPI-2: A Problem-Based Approach”, one on 
“Interactive Applications on the Grid - the CrossGricl Tutorial”, and another 
one on “Production Grid Systems and Their Programming”, and invited talks 
on MPI and lrigh-productivity programming, fault tolerance in message passing 
and in action, high-performance application execution scenarios in P-GRADE, 
an open cluster system software stack, from PVM grids to self-assembling virtual 
machines, the grid middleware of the NorduGrid, next-generation grids, and 
the Austrian Grid initiative - high-level extensions to grid middleware. These 
proceedings contain papers on the 50 contributed presentations together with 
abstracts of the invited and tutorial speakers’ presentations. 

The 11th Euro PVM/MPI conference was held together with DAPSYS 2004, 
the 5tlr Austrian-Hungarian Workshop on Distributed and Parallel Systems. 
Participants of the two events shared invited talks, tutorials, the vendors’ session, 
and social events, while contributed paper presentations proceeded in separate 
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tracks in parallel. While Euro PVM/MPI is dedicated to the latest developments 
of PVM and MPI, DAPSYS was a major event to discuss general aspects of 
distributed and parallel systems. In this way the two events were complementary 
to each other and participants of Euro PVM/MPI could benefit from the joint 
organization of the two events. 

The invited speakers of the joint Euro PVM/MPI and DAPSYS conference 
were Jack Dongarra, Gabor Dozsa, A1 Geist, William Gropp, Balazs Konya, 
Domenico Laforenza, Ewing Lusk, and Jens Volkert. The tutorials were presented 
by William Gropp and Ewing Lusk, Tomasz Szepieniec, Marcin Radecki and 
Katarzyna R.ycerz, and Peter Kacsuk, Balazs Konya, and Peter Stefan. 

We express our gratitude for the kind support of our sponsors (see below) and 
we thank the members of the Program Committee and the additional reviewers 
for their work in refereeing the submitted papers and ensuring the high quality of 
Euro PVM/MPI. Finally, we would like to express our gratitude to our colleagues 
at MTA SZTAKI and GUP, JKU Linz for their help and support during the 
conference organization. 
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Abstract. Oak Ridge National Laboratory (ORNL) leads two of the 
five big Genomes-to-Life projects funded in the USA. As a part of these 
projects researchers at ORNL have been using PVM to build a compu- 
tational biology grid that spans the USA. This talk will describe this 
effort, how it is built, and the unique features in PVM that led the re- 
searchers to choose PVM as their framework. The computations such as 
parallel BLAST are run on individual supercomputers or clusters within 
this P2P grid and are themselves written in PVM to exploit PVM’s fault 
tolerant capabilities. 

We will then describe our recent progress in building an even more adapt- 
able distributed virtual machine package called Harness. The Harness 
project includes research on a scalable, self-adapting core called H20, 
and research on fault tolerant MPI. Harness software framework provides 
parallel software “plug-ins” that adapt the run-time system to changing 
application needs in real time. This past year we have demonstrated Har- 
ness’ ability to self-assemble into a virtual machine specifically tailored 
for particular applications. 

Finally we will describe DOE’s plan to create a National Leadership 
Computing Facility, which will house a 100 TF Cray X2 system, and a 
Cray Red Storm at ORNL, and an IBM Blue Gene system at Argonne 
National Lab. We will describe the scientific missions of this facility and 
the new concept of “computational end stations” being pioneered by the 
Facility. 



1 Genomics Grid Built on PVM 

The United States Department of Energy (DOE) has embarked on an ambi- 
tious computational biology program called Genomes to Life [1] . The program is 
using DNA sequences from microbes and higher organisms, for systematically 
tackling questions about the molecular machines and regulatory pathways of 
living systems. Advanced technological and computational resources are being 
employed to identify and understand the underlying mechanisms that enable 
living organisms to develop and survive under a wide variety of environmental 
conditions. 

ORNL is a leader in two of the five Genomes to Life centers. As part of this 
effort ORNL is building a Genomics Computational Grid across the U.S. con- 
necting ORNL, Argonne National Lab, Pacific Northwest National Lab, and 
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Lawrence Berkley National Lab. The software being deployed is called The 
Genome Channel [2] . The Genome Channel is a computational biology workbench 
that allows biologists to run a wide range of genome analysis and comparison 
studies transparently on resources at the grid sites. Genome Channel is built on 
top of the PVM software. When a request comes in to the Genome Channel, 
PVM is used to track the request, create a parallel virtual machine combining 
database servers, Linux clusters, and supercomputer nodes tailored to the nature 
of the request, spawning the appropriate analysis code on the virtual machine, 
and then returning the results. 

The creators of this Genomics Grid require that their system be available 
24/7 and that analyses that are running when a failure occurs are reconfigured 
around the problem and automatically restarted. PVM’s dynamic programming 
model and fail tolerance features are ideal for this use. The Genome Channel 
has been cited in “Science” and used by thousands of researchers from around 
the world. 

2 Harness: Self-assembling Virtual Machine 

Harness [3] is the code name for the next generation heterogeneous distributed 
computing package being developed by the PVM team at Oak Ridge National 
Laboratory, the University of Tennessee, and Emory University. The basic idea 
behind Harness is to allow users to dynamically customize, adapt, and extend a 
virtual machine’s features to more closely match the needs of their application 
and to optimize the virtual machine for the underlying computer resources, for 
example, taking advantage of a high-speed I/O. As part of the Harness project, 
the University of Tennessee is developing a fault tolerant MPI called FT-MPI. 
Emory has taken the lead in the architectural design of Harness and development 
of the H20 core [4]. 

Harness was envisioned as a research platform for investigating the concepts 
of parallel plug-ins, distributed peer-to-peer control, and merging and splitting 
of multiple virtual machines. The parallel plug-in concept provides a way for 
a heterogeneous distributed machine to take on a new capability, or replace an 
existing capability with a new method across the entire virtual machine. Parallel 
plug-ins are also the means for a Harness virtual machine to self-assemble. The 
peer-to-peer control eliminates all single points of failure and even multiple points 
of failure. The merging and splitting of Harness virtual machines provides a 
means of self healing. 

The project has made good progress this past year and we have demonstrated 
the capability for a self-assembling virtual machine with capabilities tuned to 
the needs of a particular chemistry application. The second part of this talk will 
describe the latest Harness progress and results. 

3 DOE National Leadership Computing Facility 

In May of 2004 it was announced that Oak Ridge National Laboratory (ORNL) 
had been chosen to provide the USA’s most powerful open resource for capability 




PVM Grids to Self-assembling Virtual Machines 



3 



computing, and we propose a sustainable path that will maintain and extend 
national leadership for the DOE Office of Science in this critical area. 

The effort is called the National Leadership Computing Facility (NLCF) 
and engages a world-class team of partners from national laboratories, research 
institutions, computing centers, universities, and vendors to take a dramatic step 
forward to field a new capability for high-end science. Our team offers the Office 
of Science an aggressive deployment plan, using technology designed to maximize 
the performance of scientific applications, and a means of engaging the scientific 
and engineering community. 

The NLCF will immediately double the capability of the existing Cray XI 
at ORNL and further upgrade it to a 20TF Cray Xle in 2004. The NLCF will 
maintain national leadership in scientific computing by installing a 100TF Cray 
X2 in 2006. We will simultaneously conduct an in-depth exploration of alterna- 
tive technologies for next-generation leadership-class computers by deploying a 
20TF Cray Red Storm at ORNL and a 50TF IBM BlueGene/L at Argonne Na- 
tional Laboratory. These efforts will set the stage for deployment of a machine 
capable of 100TF sustained performance (300TF peak) by 2007. 

NLCF has a comparably ambitious approach to achieving a high level of 
scientific productivity. The NLCF computing system will be a unique world- 
class research resource, similar to other large-scale experimental facilities con- 
structed and operated around the world. At these facilities, scientists and engi- 
neers make use of “end stations” -best-in-class instruments supported by instru- 
ment specialists-tlrat enable the most effective use of the unique capabilities of 
the facilities. In similar fashion the NLCF will have Computational End Sta- 
tions (CESs) that offer access to best-in-class scientific application codes and 
world-class computational specialists. The CESs will engage multi-institutional, 
multidisciplinary teams undertaking scientific and engineering problems that can 
only be solved on the NLCF computers and who are willing to enhance the ca- 
pabilities of the NLCF and contribute to its effective operation. All CESs will be 
selected through a competitive peer-review process. It is envisioned that there 
will be computational end stations in climate, fusion, astrophysics, nanoscience, 
chemistry, and biology as these offer great potential for breakthrough science in 
the near term. 

The last part of this talk describes how the NLCF will bring together world- 
class researchers; a proven, aggressive, and sustainable hardware path; an experi- 
enced operational team; a strategy for delivering true capability computing; and 
modern computing facilities connected to the national infrastructure through 
state-of-the-art networking to deliver breakthrough science. Combining these re- 
sources and building on expertise and resources of the partnership, the NLCF 
will enable scientific computation at an unprecedented scale. 
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Abstract. The Austrian Grid initiative is the national effort of Austria to estab- 
lish a nation-wide grid environment for computational science and research. The 
goal of the Austrian Grid is to pursue a variety of scientific users in utilizing the 
Grid for their applications, e.g. medical sciencs, high-energy physics, applied nu- 
merical simulations, astrophyscial simulations and solar observations, as well as 
meteorology and geophysics. All these applications rely on a wide range of di- 
verse computer science technologies, composed from standard grid middleware 
and sophisticated high-level extensions, which enable the implementation and 
operation of the Austrian Grid testbed and its applications. 

One of these high-level middleware extensions is the Grid Visualization Kernel 
(GVK), which offers a means to process and transport large amounts of visual- 
ization data on the grid. In particular, GVK addresses the connection of grid ap- 
plications and visualization clients on the grid. The visualization capabilities of 
GVK are provided as flexible grid services via dedicated interfaces and protocols, 
while GVK itself relies on the grid to implement and improve the functionality 
and the performance of the visualization pipeline. As a result, users are able to ex- 
ploit visualization within their grid applications similar to how they would utilize 
visualization on their desktop workstations. 
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Abstract. This talk will describe an implementation of MPI which extends the 
message passing model to allow for recovery in the presence of a faulty process. 
Our implementation allows a user to catch the fault and then provide for a recov- 
ery. 

We will also touch on the issues related to using diskless checkpointing to allow 
for effective recovery of an application in the presence of a process fault. 
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Abstract. MPI has often been called the “assembly language” of par- 
allel programming. In fact, MPI has succeeded because, like other suc- 
cessful but low-level programming models, it provides support for both 
performance programming and for “programming in the large” - build- 
ing support tools, such as software libraries, for large-scale applications. 
Nevertheless, MPI programming can be challenging, particularly if ap- 
proached as a replacement for shared-memory style load/store program- 
ming. By looking at some representative programming tasks, this talk 
looks at ways to improve the productivity of parallel programmers by 
identifying the key communities and their needs, the strengths and weak- 
nesses of the MPI programming model and the implementations of MPI, 
and opportunities for improving productivity both through the use of 
tools that leverage MPI and through extensions of MPI. 



* This work was supported by the Mathematical, Information, and Computational Sci- 
ences Division subprogram of the Office of Advanced Scientific Computing Research, 
Office of Science, U.S. Department of Energy, under Contract W-31-109-ENG-38. 
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Abstract. The P-GRADE system provides high level graphical support for de- 
velopment and execution of high performance message-passing applications. 
Originally, execution of such applications was supported on heterogeneous work- 
station clusters in interactive mode only. Recently, the system has been substan- 
tially extended towards supporting job execution mode in Grid like execution 
environments. As a result, P-GRADE now makes possible to run the same HPC 
application using different message-passing middlewares (PVM.MPI or MPICH- 
G2) in either interactive or job execution mode, on various execution resources 
controlled by either Condor or Globus-2 Grid middlewares. 

In order to support such heterogeneous execution environments, the notion of 
execution host has been replaced with the more abstract execution resource in P- 
GRADE context. We distinguish between two basic types of execution resources: 
interactive and Grid resources. An interactive resource can be, for instance, a 
cluster on which the user has an account to login by the help of ssh. Contrarily, a 
Grid resource can be used to submit jobs (e.g. by the help of Globus-2 GRAM) 
but it does not allow interactive user’s sessions. Each resource may exhibit a 
number of facilities that determine the type and level of support P-GRADE can 
provide for executing an application on that particular resource. Most notably, 
such facilities include the available message-passing infrastructures (e.g. PVM, 
MPI) and Grid middlewares (e.g. Condor, GT-2). For instance, P-GRADE can 
provide automatic dynamic load-balancing if the parallel application is executed 
on a LINUX cluster with PVM installed and interactive access to the cluster is 
allowed. 

A resource configuration tool has also been developed for P-GRADE in order to 
facilitate the creation of all the necessary configuration files during the installation 
process. By the help of this configuration tool, even novice users (i.e biologists, 
chemists) can easily set up a customized execution resource pool containing var- 
ious kinds of resources (like desktops, clusters, supercomputers. Condor pools, 
Globus-2 sites). Having finished the configuration, applications can be launched 
and controlled from P-GRADE in a uniform way by means of a high-level GUI 
regardless of the actual executing resource. 



* This work was partially supported by the Hungarian Scientific Research Fund (OTKA 
T-042459), Hungarian Ministry of Informatics and Communications (MTA IHM 4671/1/2003) 
and National Research and Technology Office (OMFB-00495/2004). 
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Abstract. By “cluster system software,” we mean the software that 
turns a collection of individual machines into a powerful resource for a 
wide variety of applications. In this talk we will examine one loosely inte- 
grated collection of open-source cluster system software that includes an 
infrastructure for building component-based systems management tools, 
a collection of components based on this infrastructure that has been 
used for the last year to manage a medium-sized cluster, a scalable 
process-management component in this collection that provides for both 
batch and interactive use, and an MPI-2 implementation together with 
debugging and performance analysis tools that help in developing ad- 
vanced applications. 

The component infrastructure has been developed in the context of the 
Scalable Systems Software SciDAC project, where a number of system 
management tools, developed by various groups, have been tied together 
by a common communication library. The flexible architecture of this 
library allows systems managers to design and implement new systems 
components and even new communication protocols and integrates them 
into a collection of existing components. One of the components that has 
been integrated into this system is the MPD process manager; we will 
describe its capabilities. It, in turn, supports the process management 
interface used by MPICH-2, a full-featured MPI-2 implementation, for 
scalable startup, dynamic process functionality in MPI-2, and interac- 
tive debugging. This combination allows significant components of the 
systems software stack to be written in MPI for increased performance 
and scalability. 



* This work was supported by the Mathematical, Information, and Computational Sci- 
ences Division subprogram of the Office of Advanced Scientific Computing Research, 
Office of Science, U.S. Department of Energy, under Contract W-31-109-ENG-38. 
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Abstract. The Advanced Resource Connector (ARC), or the NorduGrid mid- 
dleware, is an open source software solution enabling production quality com- 
putational and data Grids. Since the first release (May 2002) the middleware is 
deployed and being used in production environments. Emphasis is put on scala- 
bility, stability, reliability and performance of the middleware. A growing num- 
ber of grid deployments chose ARC as their middleware, thus building one of the 
largest production Grids of the world. 

The NorduGrid middleware integrates computing resources (commodity comput- 
ing clusters managed by a batch system or standalone workstations) and Storage 
Elements, making them available via a secure common grid layer. ARC provides 
a reliable implementation of the fundamental grid services, such as information 
services, resource discovery and monitoring, job submission and management, 
brokering and data management. 

The middleware builds upon standard open source solutions like OpenLDAP, 
OpenSSL, SASL and Globus Toolkit 2 (GT2) libraries. NorduGrid provides in- 
novative solutions essential for a production quality middleware: Grid Manager, 
ARC GridFTP server, information model and providers (NorduGrid schema), 
User Interface and broker (a "personal'’ broker integrated into the user interface), 
extended Resource Specification Language (xRSL), and the monitoring system. 
ARC solutions are replacements and extensions of the original GT2 services, the 
middleware does not use most of the core GT2 services, such as the GRAM, the 
GT2 job submission commands, the WUftp-based gridftp server, the gatekeeper, 
the job-manager, the GT2 information providers and schemas. Moreover, ARC 
extended the RSL and made the Globus MDS functional. ARC is thus much more 
than GT2 - it offers its own set of Grid services built upon the GT2 libraries. 
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Abstract. The first part of this talk will be focused on the Grid Evolution. In fact, 
in order to discuss “what is a Next Generation Grid”, it is important to determine 
“with respect to what”. Distinct phases in the evolution of Grids are observable. 
At the beginning of the 90" s, in order to tackle huge scientific problems, in sev- 
eral important research centers tests were conducted on the cooperative use of 
geographically distributed resources, conceived as a single powerful computer. 
In 1992, Charlie Catlett and Larry Smarr coined the term “Metacomputing” to 
describe this innovative computational approach [1], 

The term Grid Computing was introduced by Foster and Kesselman a few years 
later, and in the meanwhile several other words were used to describe this new 
computational approach, such as Heterogeneous Computing, Networked Virtual 
Supercomputing, Heterogeneous Supercomputing, Seamless Computing, etc.. 
Metacomputing could be considered as the 1st generation of Grid Computing, 
some kind of “proto-Grid”. 

The Second Grid Computing generation starts around 2001, when Foster at al. 
proposed Grid Computing as an important new field, distinguished from conven- 
tional distributed computing by its focus on large-scale resource sharing, innova- 
tive applications, and, in some cases, high-performance orientation [2], 

With the advent of multiple different Grid technologies the creativity of the re- 
search community was further stimulated, and several Grid projects were pro- 
posed worldwide. But soon a new question about how to guarantee interoperabil- 
ity among Grids was raised. In fact, the Grid Community, mainly created around 
the Global Grid Forum (GGF) [3], perceived the real risk that the far-reaching vi- 
sion offered by Grid Computing could be obscured by the lack of interoperability 
standards among the current Grid technologies. 

The marriage of the Web Services technologies [4] with the Second Genera- 
tion Grid technology led to the valuable GGF Open Grid Services Architec-ture 
(OGSA) [5], and to the creation of the Grid Service concept and specifica-tion 
(Open Grid Service Infrastructure - OGSI). OGSA can be considered the mile- 
stone architecture to build Third Generation Grids. 

The second part of this talk aims to present the outcome of a group of independent 
experts convened by the European Commission with the objective to identify 
potential European Research priorities for Next Generation Grid(s) in 2005 - 
2010 [6], The Next Generation Grid Properties (“The NGG Wish List”) will be 
presented. The current Grid implementations do not individually possess all of the 
properties reported in the NGG document. However, future Grids not possessing 
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them are unlikely to be of significant use and, therefore, inadequate from both 
research and commercial perspectives. In order to real-ise the NGG vision much 
research is needed. 

During the last few years, several new terms such as Global Computing. Ubiqui- 
tous Computing, Utility Computing, Pervasive Computing, On-demand Comput- 
ing, Autonomic Computing, Ambient Intelligence [7], etc., have been coined. In 
some cases, these terms describe very similar computational approaches. Conse- 
quently, some people are raising the following questions: Are these computational 
approaches facets of the same medal? The last part of this talk will explore the 
relationship of these approaches with Grid. 
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Abstract. There are a large variety of Grid test-beds that can be used for experi- 
mental purposes by a small community. However, the number of production Grid 
systems that can be used as a service for a large community is very limited. The 
current tutorial provides introduction to three of these very few production Grid 
systems. They represent different models and policies of using Grid resources 
and hence understanding and comparing them is an extremely useful exercise to 
everyone interested in Grid technology. 

The Hungarian ClusterGrid infrastructure connects clusters during the nights and 
weekends. These clusters are used during the day for educational purposes at 
the Hungarian universities and polytechnics. Therefore a unique feature of this 
Grid the switching mechanism by which the day time and night time working 
modes are loaded to the computers. In order to manage the system as a production 
one the system is homogeneous, all the machines should install the same Grid 
software package. 

The second even larger production Grid system is the LHC-Grid that was devel- 
oped by CERN to support the Large Hydron Collider experiments. This Grid is 
also homogeneous but it works as a 24-hour service. All the computers in the Grid 
are completely devoted to offer Grid services. The LHC-Grid is mainly used by 
physists but in the EGEE project other applications like bio-medical applications 
will be ported and supported on this Grid. 

The third production Grid is the NorduGrid which is completely heterogeneous 
and the resources can join and leave the Grid at any time as they need. The Nor- 
duGrid was developed to serve the Nordic countries of Europe but now more and 
more institutions from other countries join this Grid due to its large flexibility. 
Concerning the user view an important question is how to handle this large variety 
of production Grids and other Grid test-beds. How to develop applications for 
such different Grid systems and how to port applications among them? A possible 
answer for these important questions is the use of a Grid portal technology. The 
EU GridLab project developed the GridSphere portal framework that was the 
basis of developing the P-GRADE Grid portal. By the P-GRADE portal users 
can develop workflow-like applications including HPC components and can run 
such workflows on any of the Grid systems in a transparent way. 
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Abstract. The CrossGrid project aims to develop new Grid services and tools 
for interactive compute- and data-intensive applications. This Tutorial comprises 
presentations and training exercises prepared to familiarize the user with the area 
of Grid computing being researched by the CrossGrid. We present tools aimed at 
both users and Grid application developers. The exercises cover many subjects, 
from user-friendly utilities for handling Grid jobs, through interactive monitoring 
of applications and infrastructure, to data access optimization mechanisms. 



1 Applications and Architecture 

The main objective of this Tutorial is to present the CrossGrid’s achievements in devel- 
opment of tools and grid services for interactive compute- and data-intensive applica- 
tions. The Tutorial presentations and exercises are done from three different perspec- 
tives: 

- user’s perspective, 

- application developer’s perspective, 

- system administrator’s perspective. 

The Tutorial starts with a presentation of demos of the CrossGrid applications which 
are: a simulation and visualization for vascular surgical procedures, a flood crisis team 
decision support system, distributed data analysis in high energy physics and air pollu- 
tion combined with weather forecasting [7], 

Next, we give an overview of the architecture of the CrossGrid software. The ar- 
chitecture was defined as the result of detailed analysis of requirements from applica- 
tions [4], and in its first form it was recently presented in the overview [2]. During the 
progress of the Project the architecture was refined [5]. The usage of Globus Toolkit 2.x 
was decided at the beginning of the Project for stability reasons and because of close 
collaboration with the DataGrid project. Nevertheless, the software is developed in such 
a way that it will be easy to use in future Grid systems based on OGSA standards. The 
tools and services of the CrossGrid are complementary to those of DataGrid, GridLab 
(CG has a close collaboration) and US GrADS [8]. 
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2 Tool Environment 

The tools developed in CrossGrid aim to provide new solutions for Grid application 
users and developers. 

The Migrating Desktop together with its backend - Roaming Access Server [11] — 
provides an integrated user frontend to the Grid environment. It is extensible by means 
of application and tool plugins. The various CrossGrid applications and tools are inte- 
grated with Migrating Desktop by means of plugins that are designed to act similar to 
those used in popular browsers. The Portal which provides a graphical user interface, 
in comparison with the Migrating Desktop, is the lightweight client to CrossGrid re- 
sources. It requires only the Web browser and offers more simple functionality of job 
submission and monitoring. 

The MARMOT [12] MPI verification tool enables not only the strict correctness of 
program code compliance with the MPI standard, but also helps locate deadlocks and 
other anomalous situations in the running program. Both the C and Fortran language 
binding of MPI standard 1.2 are supported. 

The GridBench tool [16] is an implementation of a set of Grid benchmarks. Such 
benchmarks are important both for site administrators and for the application developers 
and users, who require an overview of performance of their applications on the Grid 
without actually running them. Benchmarks may provide reference data for the high- 
level analysis of applications as well as parameters for performance prediction. 

The PPC performance prediction tool [3] is tackling the difficult problem of predict- 
ing performance in the Grid environment. Through the analysis of application kernels 
it is possible to derive certain analytical models and use them later for predictions. This 
task is automated by PPC. 

The G-PM, Performance Measurement tool using the OCM-G application monitor- 
ing system [1], can display performance data about the execution of Grid applications. 
The displayed measurements can be defined during the runtime. The G-PM consists of 
three components: a performance measurement component, a component for high level 
analysis and a user interface and visualization component. 

3 Grid Services 

Roaming Access Server (RAS) [11] serves the requests from the Migrating Desktop or 
Portal and forwards them to appropriate Grid services (scheduling, data access, etc.). 
RAS exposes its interface as a Web Service. It uses Scheduler API for job submission 
and GridFTP API for data transfer. 

Grid Visualisation Kernel [15] provides a visualization engine running in the Grid 
environment. It distributes the visualization pipeline and uses various optimization tech- 
niques to improve the performance of the system. 

The CrossGrid scheduling system [10] extends the EDG resource broker through 
support for MPICH-G2 applications running on multiple sites. Scheduler exposes its 
interface what allows for job submission and retrieval of its status and output. The 
Scheduler provides interface to RAS by Java EDG JSS API and extends DataGrid code 
for providing additional functionality and uses postprocessing interface. The main ob- 
jectives of the postprocessing is to gather monitoring data from the Grid, such as cluster 




16 



T. Szepieniec et al. 



load and data transfers between clusters, to build a central monitoring service to ana- 
lyze the above data and provide it in the format suitable for the scheduler to forecast the 
collected Grid parameters. 

The application monitoring system, OCM-G [1 ], is a unique online monitoring sys- 
tem for Grid applications with requests and response events generated dynamically and 
toggled at runtime. This imposes much less overhead on the application and therefore 
can provide more accurate measurements for the performance analysis tools such as 
G-PM using OMIS protocol. 

The infrastructure monitoring system, JIMS (JMX-based Infrastructure Monitoring 
System) [13] provides information that cannot be acquired from standard Grid moni- 
toring systems. JIMS can yield information on all cluster nodes and also routers or any 
SNMP devices. 

SANTA-G can provide detailed data on network packets and publishes its data into 
DataGrid R-GMA [6], Santa-G does not use external components of CrossGrid, because 
it is a low layer of architecture. 

The data access optimization is achieved with a component-expert subsystem. It is 
used to select optimal components to handle various file types and storage systems (e.g. 
tape libraries). The access time estimator extends the functionality of the Reptor replica 
manager developed in DataGrid by providing access time for various storage systems 
[14]. 

4 Testbed 

All Tutorial exercises are run on the CrossGrid testbed which is composed of 18 sites in 
9 countries. The current basic middleware is LCG-2. Most sites offer storage capacities 
around 60 GB. The hardware type ranges mostly from Pentium III to Pentium Xeon 
based systems, with RAM memories between 256MB and 2GB. Many sites offer dual 
CPU systems. The operating system in most sites is RedHat 7.3 [9]. 
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Abstract. The collective communication operations of MPI, and in gen- 
eral MPI operations with non-local semantics, require the processes par- 
ticipating in the calls to provide consistent parameters, eg. a unique root 
process, matching type signatures and amounts for data to be exchanged, 
or same operator. Under normal use of MPI such exhaustive consistency 
checks are typically too expensive to perform and would compromise 
optimizations for high performance in the collective routines. However, 
confusing and hard-to-find errors (deadlocks, wrong results, or program 
crash) can happen by inconsistent calls to collective operations. 

We suggest to use the MPI profiling interface to provide for more exten- 
sive semantic checking of calls to MPI routines with collective (non-local) 
semantics. With this, exhaustive semantic checks can be enabled during 
application development, and disabled for production runs. We discuss 
what can reasonably be checked by such an interface, and mention some 
inherent limitations of MPI to making a fully portable interface for se- 
mantic checking. The proposed collective semantics verification interface 
for the full MPI-2 standard has been implemented for the NEC propri- 
etary MPI/SX as well as other NEC MPI implementations. 



1 Introduction 

To be useful in high-performance parallel computing, the Message Passing In- 
terface (MPI) standard is carefully designed to allow highly efficient implemen- 
tations on “a wide variety of parallel computers” [3, 6]. For this reason, the error 
behavior of MPI is largely unspecified, and the standard prescribes little in terms 
of mandatory error checking and reporting. MPI is also a complex standard with 
many possibilities for making mistakes during application development. Thus 
there is a trade-off for the MPI implementer between high performance and ex- 
haustive error checking. In this note we mention several useful checks for MPI 
functions with collective (non-local) semantics that are typically not performed 
by an MPI implementation, and suggest using the profiling interface of MPI [6, 
Chapter 8] to implement such more exhaustive checks. 

One motivation for our collective semantics verification interface comes from 
the possibly confusing way the collective communication operations are described 
in the MPI standard [6, Chapter 4]. On the one hand, it is required that processes 
participating in a collective communication operation specify the same amount 
of data for processes sending data and the corresponding receivers: “in contrast 
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to point-to-point communication, the amount of data sent must exactly match 
the amount of data specified by the receive” [6, page 192]. On the other hand, 
the semantics of many of the collective communication operations are explained 
in terms of point-to-point send and receive operations, which do not impose 
this restriction, and many MPI implementations indeed implement the collective 
operations on top of point-to-point communication. Highly optimized collective 
communications will, as do the point-to-point primitives, use several different 
protocols, and for performance reasons the decision on which protocol to use in 
a given case is taken locally: the MPI standard is naturally defined to make this 
possible. For such MPI implementations non-matching data amounts can lead to 
crashes, deadlocks or wrong results. The increased complexity of the “irregular” 
MPI collectives like MPI_Gatherv, MPI_Alltoallw and so on, where different 
amounts of data can be exchanged between different process pairs aggravates 
these problems, and in our experience even expert MPI programmers sometimes 
make mistakes in specifying matching amounts of data among communicating 
pairs of processes. 

Analogous problems can occur with many other MPI functions with collective 
(non-local) completion semantics. For instance the creation of a new communi- 
cator requires all participating processes to supply the same process group in the 
call. If not, the resulting communicator will be ill- formed, and likely to give rise 
to completely unspecified behavior at some later point in the program execution. 

It would be helpful for the user to be able to detect such mistakes, but 
detection obviously requires extra communication among the participating pro- 
cesses. Thus, extensive checking of MPI calls with non-local semantics in general 
imposes overhead that is too high for usage in critical high-performance appli- 
cations, and unnecessary once the code is correct. 

A solution is to provide optional extended checks for consistent arguments to 
routines with collective semantics. This can either be built into the MPI library 
(and controlled via external environment variables), or implemented stand-alone 
and in principle in a portable way by using the MPI profiling interface. For the 
latter solution, which we have adopted for MPI/SX and other NEC MPI imple- 
mentations [1], the user only has to link with the verification interface to enable 
the extensive checks. For large parts of MPI, a verification interface can be 
implemented in terms of MPI without access to internals of a specific implemen- 
tation. However, certain MPI objects are not “first-class citizens” (the gravest 
omission is MPI_Aint, but this is a different story) and cannot be used in com- 
munication operations, eg. process groups (MPI_Group) and reduction operators 
(MPI.Op), or are not immediately accessible, like the communicator associated 
with a window or a file. Thus, a truly portable verification interface independent 
of a specific MPI implementation would have to duplicate/mimic much of the 
work already performed by the underlying MPI implementation, eg. tracking 
creation and destruction of MPI objects [4,9]. Implementing our interface for a 
specific MPI implementation with access to the internals of the implementation 
alleviates these problems, and is in fact straightforward. We recommend other 
MPI implementations to provide checking interfaces, too. 
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1.1 Related Work 

The fact that MPI is both a widely accepted and complex standard has motivated 
other work on compile- and run-time tools to assist in finding semantic errors 
during application development. We are aware of three such projects, namely 
Umpire [9] , MARMOT [4] , and MPI-CHECK [5] . All three only (partly) cover 
MPI-1, whereas our collective semantics verification interface addresses the full 
MPI-2 standard. The first two are in certain respects more ambitious in that they 
cover also deadlock detection, which in MPI/SX is provided independently by a 
suspend/resume mechanism, Umpire, however, being limited to shared-memory 
systems only. Many of the constant time checks discussed in this note, and that 
in our opinion should be done by any good MPI implementation, are performed 
by these tools, whereas the extent to which checks of consistent parameters to 
irregular collective communication operations are performed is not clear. 

The observations in this note are mostly obvious, and other MPI libraries may 
already include checks along the lines suggested here. One example we are aware 
of is ScaMPI by Scali (see www.scali.com) which provides optional checking of 
matching message lengths. 

The checks performed for MPI operations with non-local semantics by our 
verification interface are summarized in Appendix A. 



2 Verifying Communication Buffers 

For argument checking of MPI calls we distinguish between conditions that 
can be checked locally by each process in constant time, conditions that can 
be checked locally but require more than constant time (say, proportional to 
some input argument, eg. size of communicator), and conditions that cannot be 
checked locally but require communication between processes. 

A communication buffer in MPI is specified by a triple consisting of a buffer 
address, a repetition count, and an MPI datatype. 



2.1 Conditions That Can Be Checked Locally in Constant Time 

General conditions on communication buffers include: 

— count must be non-negative. 

— datatype must be committed. 

— the target communication buffer of a one-sided communication operation 
must lie properly within the target window 

— for receive buffers, datatype must define non-overlapping memory segments 

— for buffers used in MPI-IO, the displacements of the type-map of datatype 
must be monotonically non-decreasing 

— for buffers used in MPI Accumulate the primitive datatypes of datatype 
must be the same basic datatype. 
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The count condition is trivial, and probably checked already by most MPI 
implementations. For individual communication operations like MPI_Send and 
MPI_Put which take a rank and communicator or window parameter, it can 
readily be checked that rank is in range. For the one-sided communication oper- 
ations the MPI_Win_get_group call allows to get access to the group of processes 
over which the window is defined, and from this the maximum rank can easily 
be determined. 

The datatype conditions are much more challenging (listed in roughly in- 
creasing order of complexity), and checking these independently of the under- 
lying MPI implementation requires the verification interface to mimic all MPI 
datatype operations. Internally datatypes are probably flagged when committed, 
thus the first check is again trivial. In MPI/SX datatype constructors record in- 
formation on overlaps, type map, and basic datatypes (but does no explicit 
flattening) , and derived datatypes are flagged accordingly. Thus also these more 
demanding checks are done in constant time. 

For the one-sided communication calls which must also supply the location 
of data at the target, the MPI standard [3, Page 101] recommends it be checked 
that target location is within range of the target window. In MPI/SX information 
about all target windows is available locally [8], and thus the condition can 
be checked efficiently in constant time. If this information is not present, the 
profiling interface for MPI_Win_create can perform an all-gatlrer operation to 
collect the sizes of all windows. 

Some collective operations allow the special MPI_IN_PLACE argument, instead 
of a communication buffer. When allowed only at the root, like in MPIJteduce, 
correct usage can be checked locally, otherwise this check requires communica- 
tion. 

2.2 Conditions That Can Be Checked Locally 
but Require More Time 

Actual memory addresses specified by a communication buffer must all be 
proper, eg. non-NULL. Since MPI permits absolute addresses in derived datatypes, 
and allows the buffer argument to be MPI_B0TT0M, traversal of the datatype 
is required to check validity of addresses. This kind of check thus requires time 
proportional to the amount of data in the communication buffer. In MPI/SX 
certain address violations (eg. NULL addresses) are caught on the fly. 

For irregular collectives like MPI_Gatherv that receive data, no overlapping 
segments of memory must be specified by the receiving (root) process. Since the 
receive locations are determined both by the datatype and the displacement 
vector of the call, checking this requires at least traversal of the displacement 
vector. 

MPI operations involving process groups pose conditions like that one group 
argument must be a subgroup of another (MPI_Comm_create), that a list of ranks 
must contain no duplicates (MPI_Group_incl), and so on. With a proper imple- 
mentation of the group functionality, all these checks can be performed in time 
proportional to the size of the groups involved, in time no longer than the time 
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needed to perform the operation itself, such that an MPI implementation is not 
hurt by performing these checks. MPI/SX performs such checks extensively. 

Other examples of useful local, but non-constant time checks are found in 
the topology functionality of MPI. For instance checking that the input to the 
MPI_Graph_create call is indeed a symmetric graph requires 0(n + m) time, n 
and to being the number of nodes and edges in the graph. 

2.3 Conditions Requiring Communication 

The semantics of the collective communication operations of MPI require that 
data to be sent strictly matches data to be received (type matching is explained 
in [6, Page 149]). In particular the size of data sent which can be computed as 
count times the size of datatype must match the size of data sent. This is a 
non-local condition, which requires at least one extra (collective) communication 
operation to verify. To perform the stricter test for matching datatype signatures 
of sent and received data, the data type signature must be sent together with the 
data. The size of the signature is, however, proportional to the size of the data 
it describes, so an overhead of at most a factor of 2 would be incurred by this 
kind of check. Gropp [2] addresses the problem of faster signature checking, and 
proposes an elegant, cheap, and relatively safe method based on hash functions. 
At present, our verification interface does not include signature checking. 

An easy to verify condition imposed by some of the collectives is the con- 
sistent use of MPI_IN_PLACE. For instance, MPI_Allreduce and MPI_Allgather 
require that either all or no processes give MPI_IN_PLACE as a buffer argument. 

3 Verifying Other MPI Arguments 

Many MPI operations with collective semantics are rooted in the sense that 
a given root process play a particular role. In such cases all processes must 
give the same root as argument. If this is violated by mistake deadlock (in the 
case of collective communication operations like MPI_Bcast) or wrong results 
or worse (could be the case with MPI_Comm_spawn) will be the result. Verifying 
that all processes in a collective call have given the same root obviously require 
communication . 

The collective reduction operations require that if one process uses a built-in 
operation, all processes use the same operation. In this case it is also stipulated 
that the same predefined datatype be used. Verifying this again requires commu- 
nication. Note that formally it is not possible to send MPI_0p and MPI_Datatype 
handles to another process, but these handles are (in most MPI implementa- 
tions) just an integer. A portable verification interface would have to translate 
these handles into a representation that can be used in MPI communication. 

The MPI_Comm_create call requires that all processes passes identical groups 
to the call. Groups can also not be used as objects for communication, making it 
necessary for a portable interface to translate MPI_Group objects into a represen- 
tation suitable for communication. With MPI_Group_translate_ranks each pro- 
cess can extract the order of the processes in the argument group relative to the 
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group of the communicator passed in the call (extracted with MPI_Comm_group), 
and this list (of integers) can be used for verification (namely that the processes 
all compute the same list). 

MPI- 2 functionality like one-sided communication and I/O makes it possible 
to control or modify subsequent execution by providing extra information to 
window and file creation calls. Like above, it is often strictly required that all 
processes provide the same MPI_Info argument, and again verifying this require 
communication. Our verification interface internally translate the info strings 
into an integer representation suitable for communication. 

The use of run-time assertions that certain conditions hold, also applicable to 
one-sided communication and MPI-IO, gives the MPI implementation a handle 
for certain worthwhile optimizations, eg. by saving communication. The consis- 
tency requirements for using assertions must therefore be strictly observed, and 
violations will for MPI implementations which do exploit assertions most likely 
lead to undesired behavior. On the other hand, checking for consistent assertion 
usage requires communication, and is thus contradictory to the very spirit of 
assertions. However, a collective semantics verification interface should at least 
provide the possibility for assertion checking. 

Creation of virtual topologies are collective operations, and must (although 
not explicitly said in the standard) provide identical values for the virtual topol- 
ogy on all processes. For graph topologies it is required that all processes specify 
the whole communication graph. This is in itself not a scalable construct [7], and 
even checking that the input is correct requires 0(n + m ) time steps for graphs 
with n nodes and m edges (this check is performed by MPI/SX). 

A final issue is the consistent call of the collective operations themselves. 
When an MPI primitive with collective semantics is called, all processes in the 
communicator must eventually perform the call, or in other words the sequence 
of collective “events” must be the same for all processes in the communicator. 
Deadlock or crash is most likely to occur if this is not observed, eg. if some 
processes call MPI_Barrier while other call MPI_Bcast. In the MPI/SX verifi- 
cation interface each collective call on a given communicator first verifies that 
all calling processes in the communicator perform the same call. Mismatched 
communicators, eg. calling the same collective with different communicators, is 
a deadlock situation that cannot readily be caught with a profiling interface. In 
MPI/SX deadlocks are caught separately by a suspend/resume mechanism. 



4 Performing Checks Requiring Communication 

By transitivity of equality all checks of conditions requiring the processes to 
provide the same value for some argument, eg. root, MPI_IN_PLACE, but also 
that all processes indeed invoke the same collective in a given “event”, can 
be performed by a broadcast from process 0, followed by a local comparison 
to the value sent by the root. To verify matching send and receive sizes one 
extra collective communication operation suffices. For instance, MPI_Gatherv 
and MPI_Scatterv can be verified by the root (after having passed verification 
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for consistent root) scattering its vector of receive- or send-sizes to the other 
processes. For MPI_Allgatherv a broadcast of the vector from process 0 suffices, 
whereas MPI_Alltoallv and MPI_Alltoallw require an all-to-all communication. 

The overhead for collective semantics verification is 2 to 4 extra collective 
operations, but with small data. 

5 Controlling the Verification Interface 

The NEC proprietary MPI implementations by default perform almost all the 
local, constant-time checks discussed in Section 2. In Appendix A we summarize 
the verification of all MPI calls with non-local semantics. Errors detected in 
MPI calls both with local and non-local semantics lead to abortion via the error 
handler associated with the communicator (usually MPI_ERRORS_ARE_FATAL). To 
give the possibility to use the interface with other error handlers, the checks 
for MPI calls with non-local semantics (in most cases) ensure that all calling 
processes are informed of the error condition and that all call the error handler. 

Although not implemented on top of MPI, the NEC collective semantics veri- 
fication interface appears to the user as just another MPI profiling interface. The 
MPI_Pcontrol (level , . . .) function is used to control the level of verification. 
We define the following levels of checking: 

— level = 0: Checking of collective semantics disabled. 

— level = 1: Collective semantic verification enabled, error-handler abortion 
on violation with concise error message. 

— level = 2: Extended diagnostic explanation before abort. 

— level = 3: MPI_Info argument checking enabled. 

— level = 4: MPI assertion checking enabled. 

The other extreme to collective checking discussed here, is to perform no 
checks at all. It would be possible to equip our interface with a “danger” mode 
(level = -1) with no checks whatsoever performed. This might lead to a (very 
small) latency improvement for some MPI calls, but since the overhead of the 
possible constant-time checks is indeed very small, we have decided against this 
possibility. 

6 Examples 

We give some examples of common mistakes in the usage of MPI routines with 
non-local semantics, and show the output produced by the checking interface 
(level >1). 

Example 1. In a gatlrer-scatter application the root process mistakenly calls 
MPI_Gather whereas the other processes call MPI_Scatter: 

> mpirun -np 3 examplel 

VERIFY MPI_SCATTER(2) : call inconsistent with call to MPI_Gather by 0 

after which execution aborts. 




Verifying Collective MPI Calls 



25 



Example 2. In a gather application in which process 0 is the intended root some 
processes mistakenly specified the last process as root: 

> mpirun -np 3 example2 

VERIFY MPI_GATHER(2) : root 2 inconsistent with root 0 of 0 

Example 3. A classical mistake with an irregular collective: In an MPI_Alltoallv 
call each process sets its itli receive count proportional to i and its send count 
proportional to i. This is (of course!) wrong (either send or receive count should 
have been proportional to the rank of the process). The verification interface 
was motivated by precisely this situation: 

> mpirun -np 3 example3 

VERIFY MPI_ALLT0ALLV(0) : sendsize [1] =4 > expected recvsize(l) [0] =0 
VERIFY MPI_ALLT0ALLV(2) : sendsize [0] =0 < expected recvsize(O) [2] =8 

Example / . For MPI_Reduce_scatter over intercommunicators, the MPI stan- 
dard requires that the amount of data scattered by each process group matches 
the amount of data reduced by the other group. In a test application each of the 
local groups contributed an amount of data proportional to the size of the group, 
and scattered these uniformly over the other group. This could be typical - and 
is wrong if the two groups are not of the same size. The verification interface 
detects this: 

>mpirun -np 5 example4 

VERIFY MPI_REDUCE_SCATTER (INTERCDMM) : scattersize=2 != reducesize=3 

Example 5. In MPI-IO a fileview is used to individually mask out parts of a 
file for all file accesses performed by a process. The mask ( filetype ), and also 
the elementary datatype ( etype ), are represented by an MPI datatype. However, 
they must comply to a number of constraints, eg. that displacements must be 
monotonically non-decreasing, that elements of the datatype must not overlap if 
the file is opened with write-access, that the filetype is derived from (or identical 
to) the etype, that the extent of possible holes in the filetype are multiples 
of the etype’s extent, and that no I/O operations are pending. These are all 
constant time, local checks (see Section 2.1), nevertheless not always performed 
by MPI implementations, but caught by the verification interface (in MPI/SX 
also caught in normal operation): 

> mpirun -np 4 fview_leaves 

VERIFY MPI_File_set_view(0) : Filetype is not derived from etype 

Example 6. File access with shared filepointers requires that all processes (that 
have opened the file collectively) have identical fileviews in place. The verification 
interface checks this condition on each access, both collective and non-collective. 
In the implementation for MPI/SX the overhead for this check is small as the 
NEC MPI-IO implementation uses listless I/O [10] which performs caching of 
remote fileviews. 

> mpirun -np 2 shared_fview 

VERIFY MPI_FILE_WRITE_SHARED(1) : local fileview differs from fileview of (0) 

VERIFY MPI_FILE_WRITE_SHARED(0) : local fileview differs from fileview of (1) 
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7 Conclusion 

We summarized the extensive consistency and correctness checks that can be 
performed with the collective semantics verification interface provided with the 
NEC MPI implementations. We believe that such a library can be useful for 
application programmers in early stages of the development process. Partly be- 
cause of limitations of MPI (many MPI objects are not “first class citizens” and 
the functions to query them are sometimes limited), which makes the necessary 
access to internals difficult or not possible, the interface described here is not 
portable to other MPI implementations. As witnessed by other projects [4,5, 
9] implementing a verification interface in a completely portable manner using 
only MPI calls is very tedious. For every concrete MPI implementation, given 
access to information just below the surface, a verifier library as discussed here 
can be implemented with a modest effort. 
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A Verification Performed for Collective MPI Functions 



MPI function 


Non-local semantic checks 


MPI_Barrier 


Call consistency 


MPI_Bcast 


Call consistency, datasize match 


MPI.Gather 


Call, root consistency, datasize match 


MPI_Gatherv 


Call, root consistency, datasize match 


MPI_Scatter 


Call, root consistency, datasize match 


MPI_Scatterv 


Call, root consistency, datasize match 


MPI_Allgather 


Call, MPI.IN.PLACE consistency, 
datasize match 


MPI_Allgatherv 


Call, MPI.IN.PLACE consistency, 
datasize match 


MPIJVlltoall 


Call consistency, datasize match 


MPIJVlltoallv 


Call consistency, datasize match 


MPIJVlltoallw 


Call consistency, datasize match 


MPIJteduce 


Call, root, op consistency, datasize match 


MPI_Allreduce 


Call, MPI.INJPLACE, op consistency, 
datasize match 


MPI_Reduce_scatter 


Call, MPI.IN.PLACE, op consistency, 
datasize match 


MPI_Scan 


Call, op consistency, datasize match 


MPI_Exscan 


Call, op consistency, datasize match 


MPI_Comm_dup 


Call consistency 


MP I _Comm_cr e at e 


Call consistency, group consistency 


MP I _Comm_sp 1 i t 


Call consistency 


MPI_Intercomm_merge 


Call, high/low consistency 


MPI_Intercomm_create 


Call, local leader, tag consistency 


MPI_Cart .create 


Call consistency, dims consistency 


MPI .Cart .map 


Call consistency, dims consistency 


MPI_Graph_create 


Call consistency, graph consistency 


MP I .Graph .map 


Call consistency, graph consistency 


MPI_Comm_spawn{_multiple} 


Call, root consistency 


MP I _Comm_ac c ept 


Call, root consistency 


MP I _Comm_c onne c t 


Call, root consistency 


MP I _Comm_d i s c onne c t 


Call consistency 


MPI_Win_create 


Info consistency 


MPI_Win_f ence 


Assertion consistency 

(MPI_MODE_NOPRECEDE, 
MPI J10DE.N0 SUCCEED) 


MPI_File_open 


Call, info consistency 


MPI _File_set_view 


Call, info consistency, pending operations, 
type correctness 


MPI_File_set_size 


Call consistency, pending operations 


MPI_File_preallocate 


Call consistency, pending operations 


MPI_File_set_atomicity 


Call consistency 


MPI _File_seek_shar ed 


Call consistency, offset &; mode match 


MPI_File_{ write 1 read} .shared 


Fileview consistency 


MPI _File_{ write I read} .ordered 


Call consistency, fileview consistency 


MPI_File_{ write I read}_ordered_begin 


Call consistency, pending operations, 
fileview consistency 


MPI_File_{ write I read}_ordered_end 


Call consistency, operation match 


MPI_File_{ write 1 read}{_at}_all 


Call consistency 


MPI_File_{ write 1 read}{_at}_all_begin Call consistency, pending operations 


MPI _File_{ write 1 read}{_at}_all_end 


Call consistency, operation match 
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Abstract. Recent works try to optimise collective communication in 
grid systems focusing mostly on the optimisation of communications 
among different clusters. We believe that intra-cluster collective com- 
munications should also be optimised, as a way to improve the overall 
efficiency and to allow the construction of multi-level collective oper- 
ations. Indeed, inside homogeneous clusters, a simple optimisation ap- 
proach rely on the comparison from different implementation strategies, 
through their communication models. In this paper we evaluate this ap- 
proach, comparing different implementation strategies with their pre- 
dicted performances. As a result, we are able to choose the communica- 
tion strategy that better adapts to each network environment. 



1 Introduction 

The optimisation of collective communications in grids is a complex task because 
the inherent heterogeneity of the network forbids the use of general solutions. 
Indeed, the optimisation cost can be fairly reduced if we consider grids as in- 
terconnected islands of homogeneous clusters, if we can identify the network 
topology. 

Most systems only separate inter and intra-cluster communications, optimis- 
ing communication across wide-area networks, which are usually slower than 
communication inside LANs. Some examples of this “two-layered” approach in- 
clude ECO [11], MagPIe [4, 6] and even LAM-MPI 7 [8]. While ECO and MagPIe 
apply this concept for wide-area networks, LAM-MPI 7 applies it to SMP clus- 
ters, where each SMP machine is an island of fast communication. Even though, 
there is no real restriction on the number of layer and, indeed, the performance 
of collective communications can still be improved by the use of multi-level com- 
munication layers, as observed by [3]. 

If most works today use the “islands of clusters” approach, to our knowledge 
none of them tries to optimise the intra-cluster communication. We believe that 
while inter-cluster communication represents the most important aspect in grid- 
like environments, intra-cluster optimisation also should be considered, specially 
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if the clusters should be structured in multiple layers [3]. In fact, collective com- 
munications in local-area networks can still be improved with the use of message 
segmentation [1,6] or the use of different communication strategies [12]. 

In this paper we propose the use of well known techniques for collective com- 
munication, that due to the relative homogeneity inside each cluster, may reduce 
the optimisation cost. Contrarily to [13], we decided to model the performance of 
different implementation strategies for collective communications and to select, 
according to the network characteristics, the most adapted implementation tech- 
nique for each set of parameters (communication pattern, message size, number 
of processes). Hence, in this paper we illustrate our approach with two examples, 
the Broadcast and Scatter operations, and we validate our approach by compar- 
ing the performance from real communications and the models’ predictions. 

The rest of this paper is organised as follows: Section 2 presents the definitions 
and the test environment we will consider along this paper. Section 3 presents 
the communication models we developed for both Broadcast’s and Scatter’s most 
usual implementations. In Section 4 we compare the predictions from the models 
with experimental results. Finally, Section 5 presents our conclusions, as well as 
the future directions of the research. 

2 System Model and Definitions 

In this paper we model collective communications using the parameterised LogP 
model, or simply pLogP [6]. Hence, all along this paper we shall use the same 
terminology from pLogP’s definition, such as g(m) for the gap of a message of 
size m, L as the communication latency between two nodes, and P as the number 
of nodes. In the case of message segmentation, the segment size s of the message 
m is a multiple of the size of the basic datatype to be transmitted, and it splits 
the initial message m into k segments. Thus, g(s) represents the gap of a segment 
with size s. 

The pLogP parameters used to feed our models were previously obtained with 
the MPI LogP Benchmark tool [5] using LAM-MPI 6.5.9 [7]. The experiments to 
obtain pLogP parameters, as well as the practical experiments, were conducted 
on the ID/HP icluster-1 from the ID laboratory Cluster Computing Centre 1 , 
with 50 Pentium III machines (850Mhz, 256MB) interconnected by a switched 
Ethernet 100 Mbps network. 

3 Communication Models with pLogP 

Due to the limited space, we cannot present models for all collective communi- 
cation, thus we chose to present the Broadcast and the Scatter operations. Al- 
though they are two of the simplest collective communication patterns, practical 
implementations of MPI usually construct other collective operations, as for ex- 
ample, Barrier, Reduce and Gather, in a very similar way, what makes these two 

1 http:/ /www-id. imag.fr/Grappes/ 
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operations a good example for our models accuracy. Further, the optimisation of 
grid-aware collective communications explores intensively such communication 
patterns, as for example the AllGatlrer operation in MagPIe, which has three 
steps: a Gather operation inside each cluster, an AllGatherv among the clusters’ 
roots and a Broadcast to the cluster’s members. 

3.1 Broadcast 

With Broadcast, a single process, called root, sends the same message of size m 
to all other (P — 1) processes. Among the classical implementations for broad- 
cast in homogeneous environments we can find flat, binary and binomial trees, 
as well as chains (or pipelines). It is usual to apply different strategies within 
these techniques according to the message size, as for example, the use of a ren- 
dezvous message that prepares the receiver to the incoming of a large message, 
or the use of non-blocking primitives to improve communication overlap. Based 
on the models proposed by [6] , we developed the communication models for some 
current techniques and their “flavours”, which are presented on Table 1. 

We also considered message segmentation [6,12], which may improve the 
communication performance under some specific situations. An important as- 
pect, when dealing with message segmentation, is to determine the optimal seg- 
ment size. Too little messages pay more for their headers than for their content, 
while too large messages do not explore enough the network bandwidth. Hence, 
we can use the communication models presented on Table 1 to search the seg- 
ment size s that minimises the communication time in a given network. Once 
determined this segment size s, large messages can be split into \rn/s\ segments, 
while smaller messages will be transmitted without segmentation. 

As most of these variations are clearly expensive, we did not consider them 
on the experiments from Section 4, and focused only in the comparison of the 
most efficient techniques, the Binomial and the Segmented Chain Broadcasts. 

3.2 Scatter 

The Scatter operation, which is also called “personalised broadcast” , is an oper- 
ation where the root holds m x P data items that should be equally distributed 
among the P processes, including itself. It is believed that optimal algorithms 
for homogeneous networks use flat trees [6], and by this reason, the Flat Tree 
approach is the default Scatter implementation in most MPI implementations. 
The idea behind a Flat Tree Scatter is that, as each node shall receive a different 
message, the root shall sends these messages directly to each destination node. 

To better explore our approach, we constructed the communication model 
for other strategies (Table 2) and, in this paper, we compare Flat Scatter and 
Binomial Scatter in real experiments. In a first look, a Binomial Scatter is not 
as efficient as the Flat Scatter, because each node receives from the parent node 
its message as well as the set of messages it shall send to its successors. On 
the other hand, the cost to send these “combined” messages (where most part 
is useless to the receiver and should be forwarded again) may be compensate 
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Table 1. Communication Models for Broadcast. 



Implementation Technique 


Communication Model 


Flat Tree 


(P -1) x g(m ) + L 


Flat Tree Rendezvous 


(P — 1) x g(m) + 2 x g(l) + 3 x i 


Segmented Flat Tree 


(P-l)x(g(s)xk) + L 


Chain 


(P -1) x (g(m) + L) 


Chain Rendezvous 


(P — 1) x (g(m) + 2 x g(l) +3 x L) 


Segmented Chain (Pipeline) 


(P - 1) x (g(s) + L) + (g(s) x (k — 1)) 


Binary Tree 


< \l 0 g 2 P ] x (2 x g(m) + L) 


Binomial Tree 


\l 0 g 2 P J x g(m) + \l 0 g 2 P) x L 


Binomial Tree Rendezvous 


[log 2 P\ x g(m) + \l 0 g 2 P) x (2 x g( 1) + 3x1) 


Segmented Binomial Tree 


[log 2 P\ x g(s) x k + \l 0 g 2 P) x L 



Table 2. Communication Models for Scatter. 



Implementation Technique 


Communication Model 


Flat Tree 


(P -1) xj(m) + I 


Chain 


9 (j X m) + (P - 1) x L 


Binomial Tree 


9(y x m) + \log 2 P) x L 



by the possibility to execute parallel transmissions. As the trade-off between 
transmission cost and parallel sends is represented in our models, we can evaluate 
the advantages of each model according to the clusters’ characteristics. 

4 Practical Results 

4.1 Broadcast 

To evaluate the accuracy of our optimisation approach, we measured the com- 
pletion time of the Binomial and the Segmented Chain Broadcasts, and we com- 
pared these results with the model predictions. Through the analysis of Figs. 
1(a) and 1(b), we can verify that models’ predictions follow closely the real 
experiments. Indeed, both experiments and models predictions show that the 
Segmented Chain Broadcast is the most adapted strategy to our network pa- 
rameter, and consequently, we can rely on the models’ predictions to chose the 
strategy we will apply. 

Although models were accurate enough to select the best adapted strategy, 
a close look at the Fig. 1 still shows some differences between model’s predic- 
tions and the real results. We can observe that, in the case of the Binomial 
Broadcast, there is a non expected delay when messages are small. In the case of 
the Segmented Chain Broadcast, however, the execution time is slightly larger 
than expected. Actually, we believe that both variations derive from the same 
problem. 
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(a) Binomial Tree (b) Segmented Chain - 8kB segments 

Fig. 1 . Comparison between models and real results. 



Hence, we present in Fig. 2 the comparison of both strategies and their pre- 
dictions for a fixed number of machines. We can observe that predictions for the 
Binomial Broadcast fit with enough accuracy the experimental results, except in 
the case of small messages (less than 128kB). Actually, similar discrepancies were 
already observed by the LAM-MPI team, and according to [9, 10], they are due 
to the TCP acknowledgement policy on Linux that may delay the transmission 
of some small messages even when the TCP_NODELAY socket option is active 
(actually, only one every n messages is delayed, with n varying from kernel to 
kernel implementation). 



Broadcast Results - 40 machines 




Message size (bytes) 



Fig. 2. Comparison between Chain and Binomial Broadcast. 



In the case of the Segmented Chain Broadcast, however, this phenomenon af- 
fects all message sizes. Because large messages are split into small segments, such 
segments suffers from the same transmission delays as the Binomial Broadcast 
with small messages. Further, due to the Chain structure, a delay in one node 
is propagated until the end of the chain. Nevertheless, the transmission delay 
for a large message (and by consequence, a large number of segments) does not 
increases proportionally as it would be expected, but remains constant. 

We believe that because these transmission delays are related to the buffering 
policy from TCP, we believe that the first segments that arrive are delayed by 
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the TCP acknowledge policy, but the successive arrival of the following segments 
forces the transmission of the remaining segments without any delay. 



4.2 Scatter 

In the case of Scatter, we compare the experimental results from Flat and Bi- 
nomial Scatters with the predictions from their models. Due to our network 
characteristics, our experiments shown that a Binomial Scatter can be more ef- 
ficient than Flat Scatter, a fact that is not usually explored by traditional MPI 
implementations. As a Binomial Scatter should balance the cost of combined 
messages and parallel sends, it might occur, as in our experiments, that its per- 
formance outweighs the “simplicity” from the Flat Scatter with considerable 
gains according to the message size and number of nodes, as shown in Figs. 3(a) 
and 3(b). In fact, the Flat Tree model is limited by the time the root needs to 
send successive messages to different nodes (the gap), while the Binomial Tree 
Scatter depends mostly on the number of nodes, which defines the number of 
communication steps through the \l 0 g 2 P ] x L factor. These results show that 
the communication models we developes are accurate enough to identify which 
implementation is the best adapted to a specific environment and a set of pa- 
rameters (message size, number of nodes). 




(a) Flat Tree Scatter (b) Binomial Tree Scatter 



Fig. 3. Comparison between models and real results. 



Further, although we can observe some delays related to the TCP acknowl- 
edgement policy on Linux when messages are small, specially in the Flat Scatter, 
these variations are less important than those from the Broadcast, as depicted 
in Fig. 4. 

What called our attention, however, was the performance of the Flat Tree 
Scatter, that outperformed our predictions, while the Binomial Scatter follows 
the predictions from its model. We think that the multiple transmissions from 
the Flat Scatter become a ’’bulk transmission”, which forces the communication 
buffers to transfer the successive messages all together, somehow similarly to the 
successive sends on the Segmented Chain Broadcast. Hence, we observe that the 
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Scatter Results - 40 machines 




Fig. 4. Comparison between Flat and Binomial Scatter. 

pLogP parameters measured by the pLogP benchmark tool are not adapted to 
such situations, as it considers only individual transmissions, mostly adapted to 
the Binomial Scatter model. 

This behaviour seems to indicate a relationship between the number of suc- 
cessive messages sent by a node and the buffer transmission delay, which are 
not considered in the pLogP performance model. As this seem a very interest- 
ing aspect for the design of accurate communication models, we shall closely 
investigate and formalise this ’’multi-message” behaviour in a future work. 



5 Conclusions and Future Works 

Existing works that explore the optimisation of heterogeneous networks usually 
focus only the optimisation of inter-cluster communication. We do not agree 
with this approach, and we suggest to optimise both inter-cluster and intra- 
cluster communication. Hence, in this paper we described how to improve the 
communication efficiency on homogeneous cluster through the use of well known 
implementation strategies. 

To compare different implementation strategies, we rely on the modelling 
of communication patterns. Our decision to use communication models allows 
a fast and accurate performance prediction for the collective communication 
strategies, giving the possibility to choose the technique that best adapts to each 
environment. Additionally, because the intra-cluster communication is based on 
static techniques, the complexity on the generation of optimal trees is restricted 
only to the inter-cluster communication. 

Nonetheless, as our decisions rely on network models, their accuracy needs 
to be evaluated. Hence, in this paper we presented two examples that compare 
the predicted performances and the real results. We shown that the selection 
of the best communication implementation can be made with the help of the 
communication models. Even if we found some small variations in the predicted 
data for small messages, these variations were unable to compromise the final 
decision, and we could identify the probable origin from these variations. Hence, 
one of our future works include a deep investigation on the factors that lead to 
such variations, and in special the relationship between the number of successive 
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messages and the transmission delay, formalising it and proposing extensions to 
the pLogP model. 

In parallel, we will evaluate the accuracy of our models with other network 
interconnections, specially Ethernet 1Gb and Myrinet, and study how to reflect 
the presence of multi-processors and multi-networks (division of traffic) in our 
models. Our research will also include the automatic discovery of the network 
topology and the construction of optimised inter-cluster trees that work together 
with efficient intra-cluster communication. 
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Abstract. We present improved algorithms for global reduction oper- 
ations for message-passing systems. Each of p processors has a vector 
of m data items, and we want to compute the element-wise “sum” un- 
der a given, associative function of the p vectors. The result, which is 
also a vector of m items, is to be stored at either a given root processor 
(MPI_Reduce), or all p processors (MPI_Allreduce) . A further constraint 
is that for each data item and each processor the result must be computed 
in the same order, and with the same bracketing. Both problems can be 
solved in 0(m + log 2 p) communication and computation time. Such re- 
duction operations are part of MPI (the Message Passing Interface), 
and the algorithms presented here achieve significant improvements over 
currently implemented algorithms for the important case where p is not 
a power of 2. Our algorithm requires [log 2 p] + 1 rounds - one round 
off from optimal - for small vectors. For large vectors twice the number 
of rounds is needed, but the communication and computation time is 
less than 3 mf3 and 3/2m7, respectively, an improvement from 4 m/3 and 
2my achieved by previous algorithms (with the message transfer time 
modeled as a + m/3, and reduction-operation execution time as my). For 
p = 3x2™ and p = 9x2™ and small m < b for some threshold b, and 
p = q2 n with small q , our algorithm achieves the optimal [dog 2 p] number 
of rounds. 



1 Introduction and Related Work 

Global reduction operations in three different flavors are included in MPI, the 
Message Passing Interface [14]. The MPIJteduce collective combines element- 
wise the input vectors of each of p processes with the result vector stored only at 
a given root process. In MPI_Allreduce, all processes receive the result. Finally, 
in MPI_Reduce_scatter, the result vector is subdivided into p parts with given 
(not necessarily equal) numbers of elements, which are then scattered over the 
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processes. The global reduction operations are among the most used MPI col- 
lectives. For instance, a 5-year automatic profiling [12, 13] of all users on a Cray 
T3E has shown that 37% of the total MPI time was spent in MPI_Allreduce 
and that 25 % of all execution time was spent in runs that involved a non-power- 
of-two number of processes. Thus improvements to these collectives are almost 
always worth the effort. 

The p processes are numbered consecutively with ranks i = 0, . . . ,p — 1 (we 
use MPI terminology). Each has an input vector nii of m units. Operations are 
binary, associative, and possibly commutative. MPI poses other requirements 
that are non-trivial in the presence of rounding errors: 

1. For MPI_Allreduce all processes must receive the same result vector; 

2. reduction must be performed in canonical order mo + mi + • • • + m p - 1 (if 
the operation is not commutative); 

3. the same reduction order and bracketing for all elements of the result vector 
is not strictly required, but should be strived for. 

For non-commutative operations a+b may be different from b+a. In the presence 
of rounding errors a + (6 + c) may differ from (a + b) + c (two different brack- 
eting’s). The requirements ensure consistent results when performing reductions 
on vectors of floating-point numbers. 

We consider 1-ported systems, i. e. each process can send and receive a mes- 
sage at the same time. We assume linear communication and computation costs, 
i.e. the time for exchanging a message of m units is t = a + m/3 and the time 
for combining two m-vectors is t = my. 

Consider first the MPI_Allreduce collective. For p = 2" (power-of-2), 
butterfly-like algorithms that for small m are latency-optimal , for large m 
bandwidth- and work-optimal , with a smooth transition from latency dominated 
to bandwidth dominated case as m increases have been known for a long time [5, 
16] . For small m the number of communication rounds is log 2 p (which is op- 
timal [5]; this is what we mean by latency- optimal) with m \og 2 p elements 
exchanged/combined per process. For large m the number of communication 
rounds doubles because of the required, additional allgather phase, but the num- 
ber of elements exchanged/combined per process is reduced to 2(m— 1/p) (which 
is what we mean by bandwidth- and work- optimal). These algorithms are simple 
to implement and practical. 

When p is not a power-of-two the situation is different. The optimal num- 
ber of communication rounds for small m is |~log 2 p~\ , which is achieved by the 
algorithms in [3,5]. However, these algorithms assume commutative reduction 
operations, and furthermore the processes receive data in different order, such 
that requirements 1 and 2 cannot be met. These algorithms are therefore not 
suited for MPI. Also the bandwidth- and work-optimal algorithm for large m 
in [5] suffers from this problem. A repair for (very) small p would be to collect 
(allgather) the (parts of the) vectors to be combined on all processes, using for 
instance the optimal (and very practical) allgatlrer algorithm in [6], and then 
perform the reduction sequentially in the same order on each process. 
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Algorithms suitable for MPI (i.e., with respect to the requirements 1 to 3) are 
based on the butterfly idea (for large m) . The butterfly algorithm is executed on 
the largest power-of-two p' < p processes, with an extra communication round 
before and after the reduction to cater for the processes in excess of p' . Thus 
the number of rounds is no longer optimal, and if done naively an extra 2 m 
is added to the amount of data communicated for some of the processes. Less 
straightforward implementations of these ideas can be found in [13,15], which 
perform well in practice. 

The contributions of this paper are to the practically important non-powers- 
of-two case. First, we give algorithms with a smooth transition from latency to 
bandwidth dominated case based on a message threshold of b items. Second, we 
show that for the general case the amount of data to be communicated in the 
extra rounds can be reduced by more than a factor of 2 from 2m to less than 
m (precisely to m/ 2 n+1 if p is factorized in p = q2 n with q an odd number). 
Finally, for certain number of processes p = q2 n with q = 3 and q — 9 we give 
latency- and bandwidth optimal algorithms by combining the butterfly idea with 
a ring-algorithm over small rings of 3 processes; in practice these ideas may also 
yield good results for q = 5,7,..., but this is system dependent and must be 
determined experimentally. 

The results carry over to MPI_Reduce with similar improvements for the non- 
power-of-two case. In this paper we focus on MPI_Allreduce. 

Other related work on reduction-to-all can be found in [1-3] . Collective algo- 
rithms for wide-area clusters are developed in [7-9] , further protocol tuning can 
be found in [4,10,15], especially on shared memory systems in [11]. Compared 
to [15], the algorithms of this paper furthermore give a smooth transition from 
latency to bandwidth optimization and higher bandwidth and shorter latency if 
the number of processes is not a power-of-two. 



2 Allreduce for Powers-of-Two 

Our algorithms consist of two phases. In the reduction phase reduction is per- 
formed with the result scattered over subsets of the p processors. In the routing 
phase, which is only necessary if m is larger than a threshold b , the result vec- 
tor is computed by gathering the partial results over each subset of processes. 
Essentially, only a different routing phase is needed to adapt the algorithm to 
MPI_Reduce or MPI_Reduce_scatter. For MPI_Allreduce and MPIJteduce the 
routing phase is most easily implemented by reversing the communication pat- 
tern of the reduction phase. 

It is helpful first to recall briefly the hybrid butterfly algorithm as found in 
e.g. [16]. For now p = 2 n and a message threshold b is given. 

In the reduction phase a number of communication and computation rounds 
is performed. Prior to round z,z = 0 , ,n — 1 with in/2 z > b each process i 
possesses a vector of size m/2 z containing a block of the partial result ( (m j 0 + 
77ij 0 +i) + • • • + (rrii 0 + 2 z -2 + Wi 0 + 2 *— l)) where i o is obtained by setting the least 
significant z bits of i to 0. In round 2 process i sends half of its partial result 
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Process i: 



/ / Reduction phase 

m! <— m // current data size 

d <— 1 // “distance” 

while d < p do 

/ / round 

select r — 1 neighbors of i // Protocol decision 
if m’ > fc then 

exchange(m , /r) with r — 1 neighbors 
Push neighbors and data sizes on stack 
m! <— m! /r 

else exchange(m') with r — 1 neighbors 
local reduction of r (partial) vectors of size m! 
d <— d x r 
end while 
// Routing phase 
while stack non-empty 

pop neighbors and problem size off stack 
exchange with neighbors 
end while 



Fig. 1. High-level sketch of the (butterfly) reduction algorithm. For p a power of 
two a butterfly exchange step is used with r = 2. For other cases different ex- 
change/elimination steps can be used as explained in Section 3. 



to process i © 2 Z (® denotes bitwise exclusive-or, so the operation corresponds 
to flipping the 0 th bit of i), and receives the other half of this process’ partial 
result. Both processes then performs a local reduction, which establishes the 
above invariant above for round z + 1. If furthermore the processes are careful 
about the order of the local reduction (informally, either from left to right or 
from right to left), it can be maintained that the partial results on all processes 
in a group have been computed in canonical order, with the same bracketing, 
such that the requirements 1 to 3 are fulfilled. If in round z the size of the result 
vector m/2 z < b halving is not performed, in which case processes i and i © 2 2 
will end up with the same partial result for the next and all succeeding rounds. 
For the routing phase, nothing needs to be done for these rounds, whereas for 
the preceding rounds where halving was done, the blocks must be combined. 

A high-level sketch of the algorithm is given in Figure 1. In Figure 2 and 
Figure 3 the execution is illustrated for p = 8. The longer boxes shows the 
process groups for each round. The input buffer is divided into 8 segments A-Hj 
on process i. The figure shows the buffer data after each round: X-Y.;_ y is the 
result of the reduction of the segments X to Y from processes i to j. 

Following this sketch it is easy to see that the reduction phase as claimed 
takes n = log 2 p rounds. For m/p > b the amount of data sent and received per 
process is X]fc=o m /2 fc+1 = m(l — 1 /p). For m < b the routing phase is empty so 
the optimal log 2 p rounds suffice. For m > b some allgather rounds are necessary, 
namely one for each reduction round in which m/2 fc > b. At most, the number 
of rounds doubles. 
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And its message exchange pattern: 
z=0 1 2 



Distance 

Number of pairs 
in each block = 2 Z 

z=doubling level 

Type of protocol: 

A - All reduce 
R = Reduce scatter 
G = Allgather 





Mixed protocol: 
z=0 1 2 2 

:m 




A \i\2 




N 




Basic Protocol Entities 

Allreduce-step with full buffer-exchange (in the first phase) and No-op-step (in the second phase): 

0 Send total buffer receive and reduce it with own total buffer 0 Ffl Nothing to do 

Send total buffer receive and reduce it with own total buffer 0 [N] Nothing to do 



Reduce-scatter-step with buffer-halving (in the first phase) and All-gather-step with buffer-doubling (in the second phase): 
1 _ Send 2 rd half of buffer receive and reduce it with own I s1 half of own buffer 0 

□ Send I 51 half of buffer receive and reduce it with own 2 nd half of own buffer 0 

1 Send provisional result receive and store after own provisional result 

_| Send provisional result receive and store before own provisional result 



Fig. 2. The butterfly reduction algorithm. 
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Fig. 3. Intermediate results after each protocol step when the threshold b is reached 
in round 3. 

3 The Improvements: Odd Number of Processes 

We now present our improvements to the butterfly scheme when p is not a power- 
of-two. Let for now p = q2 n where 2” is the largest power of two smaller than p 
(and q is odd). 

For the general case we introduce a more communication efficient way to 
include data from processes in excess of 2” into the butterfly algorithm. We 
call this step 3-2 elimination. Based on this we give two different algorithms 
for the general case, both achieving the same bounds. For certain small values 
of q we show that a ring based algorithm can be used in some rounds of the 
butterfly algorithm, and for certain values of q results in the optimal number of 
communication rounds. 

3.1 The 3-2 Elimination Step 

For m' > b the 3-2 elimination step is used on a group of three processes po,Pi, 
and p 2 , to absorb the vector of process P 2 into the partial results of process po 
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Bandwidth-optimized: 
z=0 1 2 2 



Latency optimized: 
z=0 1 2 





Less than power-of-two: 
z=0 12 2 1 




112 2 10 z =0 12 

3-2-(Triple)-elimination-step with buffer-halving (R) and conresponding Triple All-Gather step: 
□ send 2 03 half \ 4 reduce © 1“ half 

§ 



Iff I = Send I - half 
l R / Send half 



V 

A 



reduce © 
reduce © 



; half 
half. 



send result 






□ Send buffer (=1“ part) 
Iff] = Send buffer (=2 n0 part) 






reduce © 2'* half 

[now eliminated] 

recv. after own buffer 

send own buffer v^recv. before 



recv. as 1 st part, send reev'd data 



Vecv. after 



3-2-(Triple)-elimination-step with full buffer-exchange (A) and corresponding Triple “No-op" step: 



FI 

If fl = Send buffer 
\a/ Send buffer 

□ No operation 
1 1 f| = Send buffer 

V| 



send buffer 



f reduce © (p^Pj) 



reduce © (p 0 +(p.*p 2 )) 
reduce © (p 0 +(p,+P;,)) 



' ^ reduce © (pj-t-py). send result * [now eliminated] 






Fig. 4. Overlapping elimination protocol for p = 15 and p = 13 using 3- 2-elimination- 
steps. 



and pi, which will survive for the following rounds. The step is as follows: process 
P 2 sends m ! /2 (upper) elements to process pi, and simultaneously receives m ' / 2 
(lower) elements from process p\ . Process p\ and P2 can then perform the reduc- 
tion operation on their respective part of the vector. Next, process po receives 
the m! / 2 (lower) elements of the partial result just computed from process P 2 , 
and sends ml / 2 (upper) elements to process p\. Process po and p\ compute a 
new partial result from the m' / 2 elements received. 

As can be seen process po and p\ can finish after two rounds, both with the 
half of the elements of the result vector [m' Q + (m\ + m' 2 )\. The total time for 
process po and pi is 2a: + /3m' + jm'. 

Compare this to the trivial solution based on 2-1 elimination. First process 
P2 sends all its ml elements to process p\ (2-1 elimination), after which process 
Po and pi performs a butterfly exchange of m ' / 2 elements. The time for this 
solution is 2a + 3/2 /3m' + 3/2'ym'. 
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2-1-{Double)-elimination-step without buffer-halving corresponding gather step 

P y receive & reduce© (p = +p.) H_ buffers 
buffer ' (now e l' m inated| 




Fig. 6. Plugging in any algo- 
rithm for odd number (here 
3) of processes after reducing 
p with the butterfly algorithm 
(here with two steps) to its odd 
factor. 



Fig. 5. Non-Overlapping elimination protocol for p = 
15 and p = 13 using 3-2- and 2- 1-elimination-steps. 



For m' < b , the total buffers are exchanged and reduced, see the protocol 
entities described in Fig. 4. 

The 3-2 elimination step can be plugged into the general algorithm of Fig- 
ure 1. For p = 2 n + Q with Q < 2 n , the total number of elimination steps to be 
performed is Q. The problem is to schedule these in the butterfly algorithm in 
such a way that the total number of rounds does not increase by more than 1 
for a total of n + 1 = |~log 2 p"| rounds. Interestingly we have found two solutions 
to this problem, which are illustrated in the next subsections. 

3.2 Overlapping 3-2 Elimination Protocol 

Figure 4 shows the protocol examples with 15 and 13 processes. In general, this 
protocol schedules 3- 2-elimination steps for a group of on 2 2 x 3 processes in each 
round z for which the zth bit of p is 1. The 3- 2-steps exchange two messages of 
the same size and are therefore drawn with double width. The first process is not 
involved in the first message exchange, therefore this part is omitted from the 
shape in the figure. After each 3- 2-step, the third process is eliminated, which is 
marked with dashed lines in the following rounds. The number of independent 
pairs or triples in each box is 2 Z . As can be seen the protocol does not introduce 
delays where some processes have to wait for other processes to complete their 
3-2 elimination steps of previous rounds, but different groups of processes can 
simultaneously be at different rounds. Note, that this protocol can be used in 
general for any number of processes. If p includes a factor 2 n then it starts with 
n butterfly steps. 
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3.3 Non-overlapping Elimination Protocol 

Figure 5 shows a different protocol that eliminates all excess processes at round 
z — 1. With the combination of one 3- 2-elimination-step and pairs of 2-1- 
elimination-steps any odd number of processes p is thus reduced to its next 
smaller power-of-two value. Note that for m > b in round z = 1 only m/2 data 
are sent in the 2-1-elimination step (instead of m if the 2-1 elimination would 
have been performed prior to round z = 0). 

Both the overlapping and the non-overlapping protocol are exchanging the 
same amount of data and number of messages. For small m < b the total time is 
t = (1 + |~log 2 p])a + m( 1 + \\og 2 p\)(i + m [~log 2 p\ 7, where the extra round (the 
a-term) stems from the need to send the final result to the eliminated processes. 
For largem > b the total time is t = 2|~log 2 p]a+2m(1.5 — l/p')(3+m(l.5— l/p')7 
with p' = 2" being the largest power of two smaller than p. 

This protocol is designed only for odd numbers of processes. For any number 
of processes it must be combined with the butterfly. 

3.4 Small Ring 

Let now p = r q 2". The idea here is to handle the reduction step for the r q factor 
by a ring. For r — 1 rounds process i receives data from process (i — 1) mod r and 
sends data to process (i + 1) mod r. For m > b each process sends/receives only 
m/r elements per round, whereas for m < b each process sends its full input 
vector along the ring. After the last step each process sequentially reduces the 
elements received: the requirements 1 and 2 make it necessary to postpone the 
local reductions until data from all processes have been received. For m > b each 
process has m/r elements of the result vector mo + mi + . . . + m r -\. We note 
that the butterfly exchange step can be viewed as a 2-ring; the ring algorithm is 
thus a natural generalization of the butterfly algorithm. 

For small m < b and if also r > 3 the optimal allgatlrer algorithm of [6] 
would actually be much preferable; however, the sequential reduction remains a 
bottleneck, and this idea is therefore only attractive for small p (dependent on 
the ratio of a and f3 to 7). 

Substituting the ring algorithm for the neighbor exchange step in the algo- 
rithm of Figure 1, we can implement the complete reduction phase in (r — l)q+n 
rounds. This gives a theoretical improvement for r = 3 and q = 1, 2 to the opti- 
mal number of |~log 2 p] rounds. The general algorithm would require |~log 2 p] + 1 
rounds, one more than optimal, whereas the algorithm with ring steps takes 1 
round less. Let for example p = 12 = 3 x 2 2 . The ring based algorithm needs 
2 + 2 = 4 rounds, whereas the general algorithm would take [~log 2 12] + 1 = 
4 + 1 = 5 rounds. 

3.5 Comparison 

The time needed for latency-optimized (exchange of full buffers) and bandwidth- 
optimized (recursive buffer halving or exchange of l/p of the buffer) protocols 



are: 
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Table 1. Execution time of the four protocols for odd numbers of processes ( p ) and 
different message sizes. The time is displayed as multiples of the message transfer 
latency a. In each line, the fastest protocol is marked (*). 
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a(riog 2 p| + P- 
a(2flog 2 p|) 



+(3m(p — 1) +7 m/p — 1) 

+(3m{ [~log 2 p] + 1) +7 m( flog 2 p\ ) 

1) +(3m( 2(1 — 1/p)) +7?n(l — 1/p) 
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with p’ = 2 L log 2 pJ . Table 1 compares the 4 algorithms for four cases based on 
different rations (3m / a and 7771/0, and for several numbers of processes p. The 
fastest protocol is marked in each line. Note, that this table does not necessar- 
ily gives the optimal values for the elimination protocols because they may be 
achieved by using some internal steps with buffer halving and the further steps 
without buffer halving. One can see that each algorithm has a usage range, where 
it is significantly faster than the other protocols. 



3.6 Putting the Pieces Together 

The 3- 2-elimination step and the ring exchange were two alternative exchange 
patterns that could be plugged into the high-level algorithm of Figure 1 for 
non-powers-of-two, see also Fig 6. The number of processes p = 2 n qiq2...qh is 
factorized in a) 2 n for the butterfly protocol, b) small odd numbers q\, ... qh-i 
for the ring protocol, and c) finally an odd number qh for the 3- 2-elimination or 
2-1-elimination protocol. For given p it is of course essential that each process i 
at each round 2 can determine efficiently (i.e., in constant time) what protocol 
is to be used. This amounts to determining a) exchange step (butterfly, 3-2- 
elimination, 2-1-elimination, ring), b) neighboring process(es), and c) whether 
the process will be active for the following rounds. We did not give the details; 
however, for all protocols outlined in the paper this is indeed the case, but as 
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shortcut, Table 1 is now used for the odd factors qi and vector size reduced by 
1/2” if the butterfly protocol uses buffer halving due to long vectors. 

4 Conclusion and Open Problems 

We presented an improved algorithm for the MPI_Allreduce collective for the 
important case where the number of participating processes (p) is not a power of 
two, i.e., p = 2 n q with odd q and n > 0. For general non-powers-of-two and small 
vectors, our algorithm requires [~log 2 p~\ + 1 rounds - one round off from optimal. 
For large vectors twice the number of rounds is needed, but the communication 
and computation time is less than (l + l/2 n+1 )(2m/3+m7), i.e., an improvement 
from 2{2m(3 + mj) achieved by previous algorithms [15], e.g., with p = 24 or 40, 
the execution time can be reduced by 47%. For small vectors and small q our 
algorithm achieves the optimal [log 2 p] number of rounds. 

The main open problem is whether a latency optimal allreduce algorithm 
under the MPI constraint 1- 3 with |~log 2 p] rounds is possible for any number 
of processes. We are not aware of results to the contrary. 
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Abstract. This paper presents a new scheme, Send Gather Receive 
Scatter (SGRS), to perform zero-copy datatype communication over In- 
finiBand. This scheme leverages the gather/scatter feature provided by 
InfiniBand channel semantics. It takes advantage of the capability of 
processing non-contiguity on both send and receive sides in the Send 
Gather and Receive Scatter operations. In this paper, we describe the 
design, implementation and evaluation of this new scheme. Compared 
to the existing Multi-W zero-copy datatype scheme, the SGRS scheme 
can overcome the drawbacks of low network utilization and high startup 
costs. Our experimental results show significant improvement in both 
point-to-point and collective datatype communication. The latency of a 
vector datatype can be reduced by up to 62% and the bandwidth can 
be increased by up to 400%. The Alltoall collective benchmark shows a 
performance benefit of up to 23% reduction in latency. 



1 Introduction 

The MPI (Message Passing Interface) Standard [3] has evolved as a de facto 
parallel programming model for distributed memory systems. As one of its most 
important features, MPI provides a powerful and general way of describing arbi- 
trary collections of data in memory in a compact fashion. The MPI standard also 
provides run time support to create and manage such MPI derived datatypes. 
MPI derived datatypes are expected to become a key aid in application devel- 
opment. 

In principle, there are two main goals in providing derived datatypes in MPI. 
First, several MPI applications such as (de) composition of multi-dimensional 
data volumes [1,4] and finite-element codes [2] often need to exchange data with 
algorithm-related layouts between two processes. In the NAS benchmarks such 
as MG, LU, BT, and SP, non-contiguous data communication has been found to 
be dominant [10]. Second, MPI derived datatypes provide opportunities for MPI 
implementations to optimize datatype communication. Therefore, applications 
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developed with datatype can achieve portable performance over different MPI 
applications with optimized datatype communication. 

In practice, however, the poor performance of many MPI implementations 
with derived datatypes [2, 5] becomes a barrier to using derived datatypes. A 
programmer often prefers packing and unpacking noncontiguous data manually 
even with considerable effort. Recently, a significant amount of research work 
have concentrated on improving datatype communication in MPI implemen- 
tations, including 1) Improved datatype processing system [5,13], 2) Optimized 
packing and unpacking procedures [2,5], and 3) Taking advantage of network 
features to improve noncontiguous data communication [17]. 

In this paper, we focus on improving non-contiguous data communication 
by taking advantage of InfiniBand features. We focus on zero-copy datatype 
communication over InfiniBand. Zero copy communication protocols are of in- 
creased importance because they improve memory performance and also have 
reduced host cpu involvement in moving data. Our previous work [17] used 
multiple RDMA writes, Multi- W, as an effective solution to achieve zero-copy 
datatype communication. In this paper we look at an alternate way of achieving 
zero-copy datatype communication using the send/receive semantics with the 
gather /scatter feature provided by InfiniBand. We call this scheme SGRS (Send 
Gather Receive Scatter) in the rest of this paper. This scheme can overcome 
two main drawbacks in the Multi- W scheme: low network utilization and high 
startup cost. We have implemented and evaluated our proposed SGRS scheme 
in MVAPICH, an MPI implementation over InfiniBand [12,9]. 

The rest of the paper is organized as follows. We first give a brief overview 
of InfiniBand and MVAPICH in Section 2. Section 3 provides the motivation 
for the SGRS scheme. Section 4 describes the basic approach, the design issues 
involved and the implementation details. The performance results are presented 
in Section 5. Section 6 presents related work. We draw our conclusions and 
possible future work in Section 7. 



2 Background 

In this section we provide an overview of the Send Gather/Recv Scatter feature 
in InfiniBand Architecture and MVAPICH. 

2.1 Send Gather/Recv Scatter in InfiniBand 

The InfiniBand Architecture (IBA) [6] defines a System Area Network (SAN) for 
interconnecting processing nodes and I/O nodes. It supports both channel and 
memory semantics. In channel semantics, send/receive operations are used for 
communication. In memory semantics, RDMA write and RDMA read operations 
are used instead. In channel semantics, the sender can gather data from multiple 
locations in one operation. Similarly, the receiver can receive data into multi- 
ple locations. In memory semantics, non-contiguity is allowed only in one side. 
RDMA write can gather multiple data segments together and write all data into 
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a contiguous buffer on the remote node in one single operation. RDMA read can 
scatter data into multiple local buffers from a contiguous buffer on the remote 
node. 

2.2 Overview of MVAPICH 

MVAPICH is a high performance implementation of MPI over InfiniBand. Its 
design is based on MPICH [15] and MVICH [8]. The Eager protocol is used to 
transfer small and control messages. The Rendezvous protocol is used to transfer 
large messages. Datatype communication in the current MVAPICH is directly 
derived from MPICH and MVICH without any change. Basically, the generic 
packing and unpacking scheme is used inside the MPI implementation. When 
sending a datatype message, the sender first packs the data into a contiguous 
buffer and follows the contiguous path. On the receiver side, it first receives data 
into a contiguous buffer and then unpacks data into the user buffers. In the rest 
of this paper, we refer to this scheme as Generic scheme. 

3 Motivating Case Study for the Proposed SGRS Scheme 

Consider a case study involving the transfer of multiple columns in a two dimen- 
sional M x N integer array from one process to another. There are two possible 
zero-copy schemes. The first one uses multiple RDMA writes, one per row. The 
second one uses Send Gather/Receive Scatter. We compare these two schemes 
over the VAPI layer, which is an InfiniBand API provided by Mellanox [11]. The 
first scheme posts a list of RDMA write descriptors. Each descriptor writes one 
contiguous block in each row. The second scheme posts multiple Send Gather 
descriptors and Receiver Scatter descriptors. Each descriptor has 50 blocks from 
50 different rows (50 is the maximum number of segments supported in one 
descriptor in the current version of Mellnox SDK). We will henceforth refer to 
these two schemes as “Multi- W” and “SGRS” in the plots. In the first test, we 
consider a 64 x 4096 integer array. The number of columns varies from 8 to 2048. 
The total message size varies from 2 KBytes to 512 KBytes accordingly. The 
bandwidth test is used for evaluation. As shown in Figure 1, the SGRS scheme 
consistently outperforms the Multi-W scheme. 

In the second test, the number of blocks varies from 4 to 64. The total message 
size we studied is 128 KBytes, 256 KBytes, and 512 KBytes. Figure 2 shows the 
bandwidth results with different number of blocks and different message sizes. 
When the number of blocks is small, both Multi-W and SGRS schemes per- 
form comparably. This is because the block size is relatively large. The network 
utilization in the Multi-W is still high. As the number of segments increase we 
observe a significant fall in bandwidth for the Multi-W scheme whereas the fall in 
bandwidth is negligible for the SGRS scheme. There are two reasons. First, the 
network utilization becomes lower when the block size decreases (i.e. the num- 
ber of blocks increases) in the Multi-W scheme. However, in the SGRS scheme, 
the multiple blocks in one send or receive descriptor are considered as one mes- 
sage. Second, the total startup costs in the Multi-W scheme increases with the 
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Fig. 1. Bandwidth Comparison over 
VAPI with 64 Blocks. 




Fig. 2. Bandwidth Comparison over 
VAPI with Different Number of Blocks. 



increase of the number of blocks because each block is treated as an individ- 
ual message in the Multi-W scheme and hence the startup cost is associated 
with each block. From these two examples, it can be observed that the SGRS 
scheme can overcome the two drawbacks in the Multi-W by increasing network 
utilization and reducing startup costs. These potential benefits motivate us to 
design MPI datatype communication using the SGRS scheme described in detail 
in Section 4. 



4 Proposed SGRS (Send Gather/Recv Scatter) Approach 

In this section we first describe the SGRS scheme. Then we discuss the design 
and implementation issues and finally look at some optimizations to this scheme. 
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Fig. 3. Basic Idea of the SGRS Scheme. 



Fig. 4. SGRS Protocol. 



4.1 Basic Idea 

The basic idea behind the SGRS scheme is to use the scatter/gather feature 
associated with the send receive mechanism to achieve zero copy communication. 
Using this feature we can send/receive multiple data blocks as a single message 
by posting a send gather descriptor at source and a receive scatter descriptor 
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at destination. Figure 3 illustrates this approach. The SGRS scheme can handle 
non-contiguity on both sides. As mentioned in Section 2, RDMA Write Gather or 
RDMA Read Scatter handles non-contiguity only on one side. Hence, to achieve 
zero-copy datatype communication based on RDMA operations, the Multi- W 
scheme is needed [17]. 

4.2 Design and Implementation Issues 

Communication Protocol. The SGRS scheme is deployed in Rendezvous 
protocol to transfer large datatype messages. For small datatype messages, the 
Generic scheme is used. As shown in Figure 4, the sender first sends the Ren- 
dezvous start message with the data layout information out. Second, the receiver 
receives the above message and figures out how to match the sender’s layout with 
its own layout. Then, the receiver sends the layout matching decision to the 
sender. After receiving the reply message, the sender posts send gather descrip- 
tors. It is possible that the sender may break one block into multiple blocks to 
meet the layout matching decision. There are four main design issues: Secondary 
connection, Layout exchange, Posting descriptors and Registration. 

Secondary Connection. The SGRS scheme needs a second connection to 
transmit the non-contiguous data. This need arises because it is possible in 
the existing MVAPICH design to prepost some receive descriptors on the main 
connection as a part of its flow control mechanism. These descriptors could 
unwittingly match with the gather-scatter descriptors associated with the non- 
contiguous transfer. One possible issue with the extra connection is scalability. 
In our design, there are no buffers/resources for the second connection. The HCA 
usually can support a large number of connections. Hence the extra connection 
does not hurt the scalability. 

Layout Exchange. The MPI datatype has only local semantics. To enable 
zero-copy communication, both sides should have an agreement on how to send 
and receive data. In our design, the sender first sends its layout information to 
the receiver in the Rendezvous start message as shown in Figure 4. Then the 
receiver finds a solution to match these layouts. This decision information is also 
sent back to the sender for posting send gather descriptors. To reduce the over- 
head for transferring datatype layout information, a layout caching mechanism 
is desirable [7]. Implementation details of this cache mechanism in MVAPICH 
can be found in [17]. In Section 5, we evaluate the effectiveness of this cache 
mechanism. 

Posting Descriptors. There are three issues in posting descriptors. First, if the 
number of blocks in the datatype message is larger than the maximum allowable 
gather/scatter limit, the message has to be chopped into multiple gatlrer/scatter 
descriptors. Second, the number of posted send descriptors and the number of 
posted receive descriptors must be equal. Third, for each pair of matched send 
and receive descriptors, the data length must be same. This basically needs a 
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negotiation phase. Both these issues can be handled by taking advantage of the 
Rendezvous start and reply message in the Rendezvous protocol. In our design, 
the receiver makes the matching decision taking into account the layouts as well 
as scatter-gatlrer limit. Both the sender and the receiver post their descriptors 
with the guidance of the matching decision. 

User Buffer Registration. To send data from and receive data into user buffer 
directly, the user buffers need to be registered. Given a non-contiguous datatype 
we can register each contiguous block one by one. We could also register the 
whole region which covers all blocks and gaps between blocks. Both attempts 
have their drawbacks [16]. In [16], Optimistic Group Registration (OGR) has been 
proposed to make a tradeoff between the number of registration and deregistra- 
tion operations and the total size of registered space to achieve efficient memory 
registration on datatype message buffers. 

5 Performance Evaluation 

In this section we evaluate the performance of our SGRS scheme with the Multi- 
W zero copy scheme as well as the generic scheme in MVAPICH. We do latency, 
bandwidth and CPU overhead tests using a vector datatype to demonstrate 
the effectiveness of our scheme. Then we show the potential benefits that can be 
observed for collective communication such as MPI_Alltoall that are built on top 
of point to point communication. Further we investigate the impact of layout 
caching for our design. 

5.1 Experimental Testbed 

A cluster of 8 SuperMicro SUPER X5DL8-GG nodes, each with dual Intel Xeon 
3.0 GHz processors, 512 KB L2 cache, PCI-X 64-bit 133 MHz bus, and connected 
to Mellanox InfiniHost MT23108 DualPort 4x HCAs. The nodes are connected 
using the Mellanox InfiniScale 24 port switch MTS 2400. The kernel version 
used is Linux 2.4.22smp. The InfiniHost SDK version is 3.0.1 and HCA firmware 
version is 3.0.1. The Front Side Bus (FSB) runs at 533MHz. The physical memory 
is 1 GB of PC2100 DDR- SDRAM memory. 

5.2 Vector Latency and Bandwidth Tests 

In this benchmark, increasing number of columns in a two dimensional M*4096 
integer array are transferred between two processes. These columns can be rep- 
resented by a vector datatype. Figure 5 compares the ping-pong latency in the 
MPI implementation using the two zero-copy schemes. We set up two cases for 
the number of rows (M) in this array: one is 64 and one is 128. The number 
of columns varies from 4 to 2048, the corresponding message size varies from 2 
KBytes to 512 KBytes. We also compare it with the latency of the contiguous 
transfer which serves as the lower bound. We observe that the SGRS scheme 
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reduces the latency by up to 61% compared to that of the Multi- W scheme. Fig- 
ure 6 shows the bandwidth results. The improvement factor over the Multi- W 
scheme varies from 1.12 to 4.0. 

In both latency and bandwidth tests, it can also be observed that when the 
block size is smaller, the improvement of the SGRS scheme over the Multi- W 
scheme is higher. This is because the improved network utilization in the SGRS 
scheme is more significant when the block size is small. When the block size 
is large enough, R.DMA operations on each block can achieve good network 
utilization as well. Both schemes perform comparably.Compared to the Generic 
scheme, the latency results of the SGRS scheme are better in cases when the 
block size is larger than 512 bytes. When the message size is small and the block 
size is small, the Generic scheme performs the best. This is because the memory 
copy cost is not substantial and the Generic scheme can achieve better network 
utilization. The bandwidth results of the SGRS scheme are always better than 
the Generic scheme. 
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Fig. 5. MPI Level Vector Latency. Fig- 6. MPI Level Vector Bandwidth. 



5.3 CPU Overhead Tests 

In this section we measure the CPU overhead involved for the two schemes. 
Figures 7 and 8 compare the CPU overheads associated at the sender side and 
receiver side, respectively. The SGRS scheme has lower CPU involvement on the 
sender side as compared to Multi- W scheme. However on the receiver side the 
SGRS scheme has an additional overhead as compared to practically close to 
zero overhead incase of Multi- W scheme. 



5.4 Performance of MPI_Alltoall 

Collective datatype communication can benefit from high performance point-to- 
point datatype communication provided in our implementation. We designed a 
test to evaluate MPLAlltoall performance with derived datatypes. We use the 
same vector datatype we had used for our earlier evaluation. 
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Fig. 7. Sender side CPU overhead. Fig. 8. Receiver side CPU overhead. 
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Fig. 9. MPI_Alltoall Latency. Fig. 10. Overhead of Transferring Lay- 
out Information. 

Figure 9 shows the MPLAlltoall latency performance of the various schemes 
on 8 nodes. We study the Alltoall latency over the message range 4K-512K. 
We ran these experiments for two different numbers of blocks: 64 and 128. We 
observe that the SGRS scheme outperforms the Multi- W scheme consistently. 
The gap widens as the number of blocks increases. This is because the startup 
costs in the Multi- W scheme increase with the increase of the number of blocks. 
In addition, given a message size, the network utilization decreases with the 
increase of the number of blocks in the Multi-W scheme. 

5.5 Impact of Layout Caching 

In both the Multi-W and SGRS schemes, the layout has to be exchanged between 
the sender and receiver before data communication. In this test, we studied 
the overhead of transferring the layout information. We consider a synthetic 
benchmark where this effect might be prominent. In our benchmark, we need 
to transfer the two leading diagonals of a square matrix between two processes. 
These diagonal elements are actually small blocks rather than single elements. 
Hence, the layout information is complex and we need considerable layout size 
to describe it. As the size of the matrix increases, the number of non-contiguous 
blocks correspondingly increases as well as the layout description. Figure 10 
shows the percentage of overhead that is incurred in transferring this layout 
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information when there is no layout cache as compared with the case that has a 
layout cache. For smaller message sizes, we can see a benefit of 10 percent and 
this keeps diminishing as the message size increases. Another aspect here is that 
even though for small messages the layout size is comparable with message size, 
since the layout is transferred in a contiguous manner, it takes a lesser fraction of 
time to transfer this as compared to the non-contiguous message of comparable 
size. Since the cost associated in maintaining this cache is virtually zero, for 
message sizes in this range we can benefit from layout caching. 

6 Related Work 

Many researchers have been working on improving MPI datatype communi- 
cation. Research in datatype processing system includes [5,13]. Research in 
optimizing packing and unpacking procedures includes [2,5]. The closest work 
to ours is the work [17, 14] to take advantage of network features to improve 
noncontiguous data communication. In [14], the use of InfiniBand features to 
transfer non-contiguous data is discussed in the context of ARMCI which is a 
one sided communication library. In [17], Wu et al. have systematically studied 
two main types of approach for MPI datatype communication ( Pack/Unpack - 
based approaches and Copy-Reduced approaches) over InfiniBand. The Multi-W 
scheme has been proposed to achieve zero-copy datatype communication. 

7 Conclusions and Future Work 

In this paper we presented a new zero-copy scheme to efficiently implement 
datatype communication over InfiniBand. The proposed scheme, SGRS , lever- 
ages the Send Gather/Recv Scatter feature of InfiniBand to improve the datatype 
communication performance. The experimental results we achieved show that 
this scheme outperforms the existing Multi-W zero-copy scheme in all cases for 
both point to point as well as collective operations. Compared to the Generic 
scheme, for many cases, the SGRS reduces the latency by 62%, and increases 
the bandwidth by 400%. In the cases where the total datatype message size 
is small and the contiguous block sizes are relatively small, packing/unpacking 
based schemes [17] perform better. But beyond a particular “cutoff” point, the 
zero-copy scheme performs better. The SGRS scheme pushes this cutoff point 
to a relatively smaller value compared to the Multi-W scheme. As part of future 
work, we would like to compare this scheme with other schemes and evaluate 
this scheme at the application level.A combination of this scheme with other 
schemes can be incorporated to choose the best scheme for a given datatype 
message adaptively. 
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Abstract. The one-sided communication operations in MPI are 
intended to provide the convenience of directly accessing remote mem- 
ory and the potential for higher performance than regular point-to-point 
communication. Our performance measurements with three MPI imple- 
mentations (IBM MPI, Sun MPI, and LAM) indicate, however, that one- 
sided communication can perform much worse than point-to-point com- 
munication if the associated synchronization calls are not implemented 
efficiently. In this paper, we describe our efforts to minimize the overhead 
of synchronization in our implementation of one-sided communication in 
MPICH-2. We describe our optimizations for all three synchronization 
mechanisms defined in MPI: fence, post-start-complete-wait, and lock- 
unlock. Our performance results demonstrate that, for short messages, 
MPICH-2 performs six times faster than LAM for fence synchronization 
and 50% faster for post-start-complete-wait synchronization, and it per- 
forms more than twice as fast as Sun MPI for all three synchronization 
methods. 



1 Introduction 

MPI defines one-sided communication operations that allow users to directly 
access the memory of a remote process [9]. One-sided communication both is 
convenient to use and has the potential to deliver higher performance than reg- 
ular point-to-point (two-sided) communication, particularly on networks that 
support one-sided communication natively, such as InfiniBand and Myrinet. On 
networks that support only two-sided communication, such as TCP, it is harder 
for one-sided communication to do better than point-to-point communication. 
Nonetheless, a good implementation should strive to deliver performance as close 
as possible to that of point-to-point communication. 

One-sided communication in MPI requires the use of one of three synchro- 
nization mechanisms: fence, post-st art-complete- wait, or lock-unlock. The syn- 
chronization mechanism defines the time at which the user can initiate one-sided 
communication and the time when the operations are guaranteed to be com- 
pleted. The true cost of one-sided communication, therefore, must include the 
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time taken for synchronization. An unoptimized implementation of the synchro- 
nization functions may perform more communication and synchronization than 
necessary (such as a barrier), which can adversely affect performance, particu- 
larly for short and medium-sized messages. 

We measured the performance of three MPI implementations, IBM MPI, 
Sun MPI, and LAM [8] , for a test program that performs nearest-neighbor ghost- 
area exchange, a communication pattern common in many scientific applications 
such as PDE simulations. We wrote four versions of this program: using point- 
to-point communication (isend/irecv) and using one-sided communication with 
fence, post-st art-complete- wait, and lock-unlock synchronization. We measured 
the time taken for a single communication step (each process exchanges data 
with its four neighbors) by doing the step a number of times and calculating 
the average. Figure 1 shows a snippet of the fence version of the program, and 
Figure 2 shows the performance results. 

for (i=0; i<ntimes; i++) { 

MPI_Win_fence(MPI_MODE_NOPRECEDE, win) ; 
for (j=0; j<nbrs; j++) { 

MPI_Put (sbuf + j*n, n, MPI.INT, nbr[j], j, n, MPI_INT, win); 

> 

MPI_Win_f ence (MPI_M0DE_N0ST0RE | MPI_M0DE_N0PUT I MPI_MODE_NOSUCCEED , win); 

} 

Fig. 1 . Fence version of the test. 



With IBM MPI on an SP, one-sided communication is almost two orders of 
magnitude slower than point-to-point (pt2pt) for short messages and remains 
significantly slower until messages get larger than 256 KB. With Sun MPI on a 
shared-memory SMP, all three one-sided versions are about six times slower than 
the point-to-point version for short messages. With LAM on a Linux cluster con- 
nected with fast ethernet, for short messages, post-start-complete-wait (pscw) is 
about three times slower than point-to-point, and fence is about 18 times slower 
than point-to-point 1 . As shown in Figure 1, we pass appropriate assert values 
to MPI_Win_f ence so that the MPI implementation can optimize the function. 
Since LAM does not support asserts, we commented them out when using LAM. 
At least some of the poor performance of LAM with fence can be attributed to 
not taking advantage of asserts. 

We observed similar results for runs with different numbers of processes on 
all three implementations. Clearly, the overhead associated with synchroniza- 
tion significantly affects the performance of these implementations. Other re- 
searchers [4] have found similarly high overheads in their experiments with four 
MPI implementations: NEC, Hitachi, Sun, and LAM. 

Our goal in the design and implementation of one-sided communication in our 
MPI implementation, MPICH-2, has been to minimize the amount of additional 
communication and synchronization needed to implement the semantics defined 
by the synchronization functions. We particularly avoid using a barrier anywhere. 

1 LAM does not support lock-unlock synchronization. 
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Fig. 2. Performance of IBM MPI, Sun MPI, and LAM for a nearest-neighbor ghost- 
area exchange test. 



As a result, we are able to achieve much higher performance than do other MPI 
implementations. We describe our optimizations and our implementation in this 
paper. 

2 Related Work 

One-sided communication as a programming paradigm was made popular ini- 
tially by the SHMEM library on the Cray T3D and T3E [6], the BSP library [5], 
and the Global Arrays library [12]. After the MPI-2 Forum defined an interface 
for one-sided communication in MPI, several vendors and a few research groups 
implemented it, but, as far as we know, none of these implementations specifi- 
cally optimizes the synchronization overhead. For example, the implementations 
of one-sided communication for Sun MPI by Booth and Mourao [3] and for the 
NEC SX-5 by Traff et al. [13] use a barrier to implement fence synchroniza- 
tion. Other efforts at implementing MPI one-sided communication include the 
implementation for InfiniBand networks by Jiang et al. [7], for a Windows imple- 
mentation of MPI (WMPI) by Mourao and Silva [11], for the Fujitsu VPP5000 
vector machine by Asai et al. [1], and for the SCI interconnect by Worringen 
et al. [14]. Mourao and Booth [10] describe issues in implementing one-sided 
communication in an MPI implementation that uses multiple protocols, such as 
TCP and shared memory. 

3 One-Sided Communication in MPI 

In MPI, the memory that a process allows other processes to access via one- 
sided communication is called a window. Processes specify their local windows 
to other processes by calling the collective function MPI_Win_create. The three 
functions for one-sided communication are MPI_Put (remote write), MPI_Get (re- 
mote read), and MPI_Accumulate (remote update). They are nonblocking func- 
tions: They initiate but not necessarily complete the one-sided operation. These 
three functions are not sufficient by themselves because one needs to know when 
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Process 0 

MPI_Win_f ence (win) 
MPI_Put (1) 

MPI_Get (1) 
MPI_Win_f ence (win) 



Process 1 

MPI_Win_f ence (win) 
MPI_Put (0) 

MPI_Get (0) 
MPI_Win_f ence (win) 



a. Fence synchronization 



Process 0 



Process 1 

MPI_Win_post (0,2) 



Process 2 



MPI_Win_start (1) 
MPI_Put (1) 

MPI_Get (1) 
MPI_Win_complete (1) 



MPI_Win_start (1) 
MPI_Put (1) 

MPI_Get (1) 
MPI_Win_complete (1) 



MPI_Win_wait (0,2) 

b. Post-start-complete- wait synchronization 



Process 0 

MPI_Win_create (&win) 
MPI_Win_lock ( shared, 1) 
MPI_Put (1) 

MPI_Get (1) 
MPI_Win_unlock ( 1 ) 
MPI_Win_f ree ( twin) 



MPI_Win_free (&win) 
c. Lock-unlock synchronization 



Process 1 

MPI_Win_create ( &win) 



Process 2 

MPI_Win_create (&win) 
MPI_Win_lock (shared, 1) 
MPI_Put (1) 

MPI_Get (1) 
MPI_Win_unlock (1) 
MPI_Win_free (&win) 



Fig. 3. The three synchronization mechanisms for one-sided communication in MPI. 

a one-sided operation can be initiated (that is, when the remote memory is ready 
to be read or written) and when a one-sided operation is guaranteed to be com- 
pleted. To specify these semantics, MPI defines three different synchronization 
mechanisms. 

Fence. Figure 3a illustrates the fence method of synchronization (without the 
syntax). MPI_Win_f ence is collective over the communicator associated with the 
window object. A process may issue one-sided operations after the first call to 
MPI_Win_f ence returns. The next fence completes any one-sided operations that 
this process issued after the preceding fence, as well as the one-sided operations 
other processes issued that had this process as the target. The drawback of the 
fence method is that if only small subsets of processes are actually communi- 
cating with each other, the collectiveness of the fence function over the entire 
communicator results in unnecessary synchronization overhead. 

Post-Start-Complete- Wait. To avoid the drawback of fence, MPI defines a 
second mode of synchronization in which only subsets of processes need to syn- 
chronize, as shown in Figure 3b. A process that wishes to expose its local window 
to remote accesses calls MPI_Win_post, which takes as argument an MPI_Group 
object that specifies the set of processes that will access the window. A process 
that wishes to perform one-sided communication calls MPI_Win_start, which 
also takes as argument an MPI_Group object that specifies the set of processes 
that will be the target of one-sided operations from this process. After issuing 
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all the one-sided operations, the origin process calls MPI_Win_complete to com- 
plete the operations at the origin. The target calls MPI_Win_wait to complete 
the operations at the target. 

Lock-Unlock. In this synchronization method, the origin process calls 
MPI_Win_lock to obtain either shared or exclusive access to the window on the 
target, as shown in Figure 3c. After issuing the one-sided operations, it calls 
MPI_Win_unlock. The target does not make any synchronization call. When 
MPI_Win_unlock returns, the one-sided operations are guaranteed to be com- 
pleted at the origin and the target. MPI_Win_lock is not required to block until 
the lock is acquired, except when the origin and target are one and the same 
process. 

4 Implementing MPI One-Sided Communication 

Our current implementation of one-sided communication in MPICH-2 is layered 
on the same lower-level communication abstraction we use for point-to-point 
communication, called CH3 [2]. CH3 uses a two-sided communication model 
in which the sending side sends packets followed optionally by data, and the 
receiving side explicitly posts receives for packets and, optionally, data. The 
content and interpretation of the packets are decided by the upper layers. We 
have simply added new packet types for one-sided communication. So far, CH3 
has been implemented on top of TCP and shared memory, and therefore our 
implementation of one-sided communication runs on TCP and shared memory. 
Implementations of CH3 on other networks are in progress. 

For all three synchronization methods, we do almost nothing in the first syn- 
chronization call; do nothing in the calls to put, get, or accumulate other than 
queuing up the requests locally; and instead do everything in the second syn- 
chronization call. This approach allows the first synchronization call to return 
immediately without blocking, reduces or eliminates the need for extra commu- 
nication in the second synchronization call, and offers the potential for commu- 
nication operations to be aggregated and scheduled efficiently as in BSP [5]. We 
describe our implementation below. 



4. 1 Fence 

An implementation of fence synchronization must take into account the following 
semantics: A one-sided operation cannot access a process’s window until that 
process has called fence, and the next fence on a process cannot return until all 
processes that need to access that process’s window have completed doing so. 

A naive implementation of fence synchronization could be as follows. At the 
first fence, all processes do a barrier so that everyone knows that everyone else has 
called fence. Puts, gets, and accumulates can be implemented either as blocking 
or nonblocking operations. In the second fence, after all the one-sided operations 
have been completed, all processes again do a barrier to ensure that no process 
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leaves the fence before other processes have finished accessing its window. This 
method requires two barriers, which can be quite expensive. 

In our implementation, we avoid the two barriers completely. In the first 
call to fence, we do nothing. For the puts, gets, and accumulates that follow, 
we simply queue them up locally and do nothing else, with the exception that 
any one-sided operation whose target is the origin process itself is performed 
immediately by doing a simple memory copy or local accumulate. In the second 
fence, each process goes through its list of queued one-sided operations and 
determines, for every other process i, whether any of the one-sided operations 
have i as the target. This information is stored in an array, such that a 1 in 
the ith location of the array means that one or more one-sided operations are 
targeted to process i, and a 0 means no one-sided operations are targeted to that 
process. All processes now do a reduce-scatter sum operation on this array (as in 
MPI_Reduce_scatter). As a result, each process now knows how many processes 
will be performing one-sided operations on its window, and this number is stored 
in a counter in the MPI_Win object. Each process is now free to perform the data 
transfer for its one-sided operations; it needs only to ensure that the window 
counter at the target gets decremented when all the one-sided operations from 
this process to that target have been completed. 

A put is performed by sending a put packet containing the address, count, 
and datatype information for the target. If the datatype is a derived datatype, an 
encapsulated version of the derived datatype is sent next. Then follows the actual 
data. The MPI progress engine on the target receives the packet and derived 
datatype, if any, and then directly receives the data into the correct memory 
locations. No rendezvous protocol is needed for the data transfer, because the 
origin has already been authorized to write to the target window. Gets and 
accumulates are implemented similarly. 

For the last one-sided operation, the origin process sets a field in the packet 
header indicating that it is the last operation. The target therefore knows to 
decrement its window counter after this operation has completed at the target. 
When the counter reaches 0, it indicates that all remote processes that need 
to access the target’s window have completed their operations, and the target 
can therefore return from the second fence. This scheme of decrementing the 
counter only on the last operation assumes that data delivery is ordered, which 
is a valid assumption for the networks we currently support. On networks that 
do not guarantee ordered delivery, a simple sequence-numbering scheme can be 
added to achieve the same effect. 

We have thus eliminated the need for a barrier in the first fence and replaced 
the barrier at the end of the second fence by a reduce-scatter at the beginning of 
the second fence before any data transfer. After that, all processes can do their 
communication independently and return when they are done. 

4.2 Post- Start- Complete- Wait 

An implementation of post-start-complete-wait synchronization must take into 
account the following semantics: A one-sided operation cannot access a process’s 
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window until that process has called MPI_Win_post, and a process cannot return 
from MPI_Win_wait until all processes that need to access that process’s window 
have completed doing so and called MPI_Win_complete. 

A naive implementation of this synchronization could be as follows. MPI_Win 
_start blocks until it receives a message from all processes in the target group 
indicating that they have called MPI_Win_post. Puts, gets, and accumulates can 
be implemented as either blocking or nonblocking functions. MPI_Win_complete 
waits until all one-sided operations initiated by that process have completed 
locally and then sends a done message to each target process. MPI_Win_wait on 
the target blocks until it receives the done message from each origin process. 
Clearly, this method involves a great deal of synchronization. 

We have eliminated most of this synchronization in our implementation as 
follows. In MPI_Win_post, if the assert MPI_MODE_NOCHECK is not specified, the 
process sends a zero-byte message to each process in the origin group to in- 
dicate that MPI_Win_post has been called. It also sets the counter in the win- 
dow object to the size of this group. As in the fence case, this counter will get 
decremented by the completion of the last one-sided operation from each origin 
process. MPI_Win_wait simply blocks and invokes the progress engine until this 
counter reaches zero. 

On the origin side, we do nothing in MPI_Win_start. All the one-sided op- 
erations following MPI_Win_start are simply queued up locally as in the fence 
case. In MPI_Win_complete, the process first waits to receive the zero-byte mes- 
sages from the processes in the target group. It then performs all the one-sided 
operations exactly as in the fence case. The last one-sided operation has a field 
set in its packet that causes the target to decrement its counter on completion 
of the operation. If an origin process has no one-sided operations destined to a 
target that was part of the group passed to MPI_Win_start, it still needs to send 
a packet to that target for decrementing the target’s counter. MPI_Win_complete 
returns when all its operations have locally completed. 

Thus the only synchronization in this implementation is the wait at the 
beginning of MPI_Win_complete for a zero-byte message from the processes in 
the target group, and this too can be eliminated if the user specifies the assert 
MPI_MODE_NOCHECK to MPI_Win_post and MPI_Win_start (similar to MPIJtsend). 

4.3 Lock-Unlock 

Implementing lock-unlock synchronization when the window memory is not di- 
rectly accessible by all origin processes requires the use of an asynchronous agent 
at the target to cause progress to occur, because one cannot assume that the 
user program at the target will call any MPI functions that will cause progress 
periodically. 

Our design for the implementation of lock-unlock synchronization involves 
the use of a thread that periodically wakes up and invokes the MPI progress en- 
gine if it finds that no other MPI function has invoked the progress engine within 
some time interval. If the progress engine had been invoked by other calls to MPI, 
the thread does nothing. This thread is created only when MPI_Win_create is 




64 



R. Thakur, W. Gropp, and B. Toonen 



called and if the user did not pass an info object to MPI_Win_create with the key 
no_locks set to true (indicating that he will not be using lock-unlock synchro- 
nization). In MPI_Win_lock, we do nothing but queue up the lock request locally 
and return immediately. The one-sided operations are also queued up locally. 
All the work is done in MPI_Win_unlock. 

For the general case where there are multiple one-sided operations, we imple- 
ment MPI_Win_unlock as follows. The origin sends a “lock-request” packet to the 
target and waits for a “lock-granted” reply. When the target receives the lock 
request, it either grants the lock by sending a lock-granted reply to the origin 
or queues up the lock request if it conflicts with the existing lock on the win- 
dow. When the origin receives the lock-granted reply, it performs the one-sided 
operations exactly as in the other synchronization modes. The last one-sided op- 
eration, indicated by a field in the packet header, causes the target to release the 
lock on the window after the operation has completed. Therefore, no separate 
unlock request needs to be sent from origin to target. 

The semantics specify that MPI_Win_unlock cannot return until the one-sided 
operations are completed at both origin and target. Therefore, if the lock is a 
shared lock and none of the operations is a get, the target sends an acknowl- 
edgment to the origin after the last operation has completed. If any one of the 
operations is a get, we reorder the operations and perform the get last. Since 
the origin must wait to receive data, no additional acknowledgment is needed. 
This approach assumes that data transfer in the network is ordered. If not, an 
acknowledgment is needed even if the last operation is a get. If the lock is an 
exclusive lock, no acknowledgment is needed even if none of the operations is 
a get, because the exclusive lock on the window prevents another process from 
accessing the data before the operations have completed. 

Optimization for Single Operations. If the lock-unlock is for a single short 
operation and predefined datatype at the target, we send the put /accumulate 
data or get information along with the lock-request packet itself. If the target 
can grant the lock, it performs the specified operation right away. If not, it 
queues up the lock request along with the data or information and performs the 
operation when the lock can be granted. Except in the case of get operations, 
MPI_Win_unlock blocks until it receives an acknowledgment from the target that 
the operation has completed. This acknowledgment is needed even if the lock is 
an exclusive lock because the origin does not know whether the lock has been 
granted. 

Similar optimizations are possible for multiple one-sided operations, but at 
the cost of additional queuing/buffering at the target. 

5 Performance Results 

To study the performance of our implementation, we use the same glrost-area 
exchange program described in Section 1. Figure 4 shows the performance of the 
test program with MPICH-2 on a Linux cluster with fast ethernet and on a Sun 




